90/10 rule - phenomenon - When 90% of all the stuff management tells you to deploy is monitoring and orchestration software. The remaining 10% is actual make-us-money software.
If you've been following the development activity of Systembot, the bot I wrote to monitor my machines (physical as well as virtual) you've probably noticed that I changed a number of things around pretty suddenly. This is because the version of Systembot in question had some pretty incorrect assumptions about how things should work. For starters, I thought I was being clever when I wrote the temperature monitoring code when I decided to use what the drivers thought were high or critical values for sending "something is wrong" alerts. No math (aside from a Centigrade-to-Fahrenheit conversion), just a couple of values helpfully supplied by the drivers by way of psutil (which is a fantastic module, by the way; I don't play with it enough). This was hunky-dory until Leandra started running a backup job and her CPU temperature spiked to 125 degrees Fahrenheit while encrypting the data. 125 degrees isn't terribly hot as servers go, but the lm_sensors drivers seem to disagree. Additionally, my assumptions of how often to send the "high temperature" alerts (after every four cycles through the "do stuff" loop) were... naive? Optimistic?
Let's go with optimistic.
What it boiled down to was that I was getting hammered with "temperature is too high!" warning messages roughly six times a second. Some experiments with changing the delay were equally optimistic and futile. I bit the bullet and made the delay-between-alerts configurable. What I have yet to do is make the frequency of different kinds of warning events configurable, because right now they all use the same delay (defined in time_between_alerts). Setting this value to 0 disables sending warnings entirely. This is less suboptimal at best but it's not waking me up every few seconds so I think it'll hold for a couple of days until I can break this logic out a little.
The second assumption that came back to bite me (hardcoding values until something like this happened aside) was that alerting on 80% of a disk being in use without any context isn't necessarily a good idea. My media server at home was also chirping several times a second because one of the hard drives is currently at 85% of capacity. This seems reasonable at first scratch but when you dig a little deeper it's not. 85% of capacity in this case means that there are "only" 411 gigabytes of space left on a 4 terabyte hard drive. Stuff doesn't get added to that drive very often, so that 400+ gigs will last me another couple of months, at least. There's no reason to alert on this, so making this value a parameter in the config file buys me some time before I have to buy another hard drive.
A persistent risk of websites is the possibility of somebody finding a vulnerability in the CMS and backdooring the code so that commands and code can be executed remotely. At the very least it means that somebody can poke around in the directory structure of the site without being noticed. At worst it would seem that the sky's the limit. In the past, I've seen cocktails of browser exploits injected remotely into the site's theme that try to pop everybody who visits the site, but that is by no means the nastiest thing that somebody could do. This begs the question, how would you detect such a thing happening to your site?
I'll leave the question of logfile monitoring aside, because that is a hosting situation-dependent thing and everybody has their own opinions. What I wanted to discuss was the possibility of monitoring the state of every file of a website to detect unauthorized tampering. There are solutions out there, to be sure - the venerable Tripwire, the open source AIDE, and auditd (which I do not recommend - you'd need to write your own analysis software for its logs to determine what files, if any, have been edited. Plus it'll kill a busy server faster than drilling holes in a car battery.) If you're in a shared hosting situation like I am, your options are pretty limited because you're not going to have the access necessary to install new packages, and you might not be able to compile and install anything to your home directory. However, you can still put something together that isn't perfect but is fairly functional and will get the job done, within certain limits. Here's how I did it:
Most file monitoring systems store cryptographic hashes of the files they're set to watch over. Periodically, the files in question are re-hashed and the outputs are compared. If the resulting hashes of one or more files are different from the ones in the database, the files have changed somehow and should be manually inspected. The process that runs the comparisons is scheduled to run automatically, while generation of the initial database is normally a manual process. What I did was use command line utilities to walk through every file of my website, generate a SHA-1 hash (I know, SHA-1 is considered harmful these days; my threat model does not include someone spending large amounts of computing time to construct a boobytrapped index.php file with the same SHA-1 hash as the existing one; in addition, I want to be a good customer and not crush the server my site is hosted on several times a day when the checks run), and store the hashes in a file in my home directory.