Systembot: Adventures in system monitoring.

Dec 28 2018

If you've been following the development activity of Systembot, the bot I wrote to monitor my machines (physical as well as virtual) you've probably noticed that I changed a number of things around pretty suddenly.  This is because the version of Systembot in question had some pretty incorrect assumptions about how things should work.  For starters, I thought I was being clever when I wrote the temperature monitoring code when I decided to use what the drivers thought were high or critical values for sending "something is wrong" alerts.  No math (aside from a Centigrade-to-Fahrenheit conversion), just a couple of values helpfully supplied by the drivers by way of psutil (which is a fantastic module, by the way; I don't play with it enough).  This was hunky-dory until Leandra started running a backup job and her CPU temperature spiked to 125 degrees Fahrenheit while encrypting the data.  125 degrees isn't terribly hot as servers go, but the lm_sensors drivers seem to disagree.  Additionally, my assumptions of how often to send the "high temperature" alerts (after every four cycles through the "do stuff" loop) were... naive? Optimistic?

Let's go with optimistic.

What it boiled down to was that I was getting hammered with "temperature is too high!" warning messages roughly six times a second.  Some experiments with changing the delay were equally optimistic and futile.  I bit the bullet and made the delay-between-alerts configurable.  What I have yet to do is make the frequency of different kinds of warning events configurable, because right now they all use the same delay (defined in time_between_alerts).  Setting this value to 0 disables sending warnings entirely.  This is less suboptimal at best but it's not waking me up every few seconds so I think it'll hold for a couple of days until I can break this logic out a little.

The second assumption that came back to bite me (hardcoding values until something like this happened aside) was that alerting on 80% of a disk being in use without any context isn't necessarily a good idea.  My media server at home was also chirping several times a second because one of the hard drives is currently at 85% of capacity.  This seems reasonable at first scratch but when you dig a little deeper it's not.  85% of capacity in this case means that there are "only" 411 gigabytes of space left on a 4 terabyte hard drive.  Stuff doesn't get added to that drive very often, so that 400+ gigs will last me another couple of months, at least.  There's no reason to alert on this, so making this value a parameter in the config file buys me some time before I have to buy another hard drive.