Systembot: Adventures in system monitoring.

Dec 28 2018

If you've been following the development activity of Systembot, the bot I wrote to monitor my machines (physical as well as virtual) you've probably noticed that I changed a number of things around pretty suddenly.  This is because the version of Systembot in question had some pretty incorrect assumptions about how things should work.  For starters, I thought I was being clever when I wrote the temperature monitoring code when I decided to use what the drivers thought were high or critical values for sending "something is wrong" alerts.  No math (aside from a Centigrade-to-Fahrenheit conversion), just a couple of values helpfully supplied by the drivers by way of psutil (which is a fantastic module, by the way; I don't play with it enough).  This was hunky-dory until Leandra started running a backup job and her CPU temperature spiked to 125 degrees Fahrenheit while encrypting the data.  125 degrees isn't terribly hot as servers go, but the lm_sensors drivers seem to disagree.  Additionally, my assumptions of how often to send the "high temperature" alerts (after every four cycles through the "do stuff" loop) were... naive? Optimistic?

Let's go with optimistic.

What it boiled down to was that I was getting hammered with "temperature is too high!" warning messages roughly six times a second.  Some experiments with changing the delay were equally optimistic and futile.  I bit the bullet and made the delay-between-alerts configurable.  What I have yet to do is make the frequency of different kinds of warning events configurable, because right now they all use the same delay (defined in time_between_alerts).  Setting this value to 0 disables sending warnings entirely.  This is less suboptimal at best but it's not waking me up every few seconds so I think it'll hold for a couple of days until I can break this logic out a little.

The second assumption that came back to bite me (hardcoding values until something like this happened aside) was that alerting on 80% of a disk being in use without any context isn't necessarily a good idea.  My media server at home was also chirping several times a second because one of the hard drives is currently at 85% of capacity.  This seems reasonable at first scratch but when you dig a little deeper it's not.  85% of capacity in this case means that there are "only" 411 gigabytes of space left on a 4 terabyte hard drive.  Stuff doesn't get added to that drive very often, so that 400+ gigs will last me another couple of months, at least.  There's no reason to alert on this, so making this value a parameter in the config file buys me some time before I have to buy another hard drive.

A similar problem came up with monitoring the amount of usable memory left on a system.  I'd been taking psutil at its word so that I'd have to do less math:

(env) [drwho@windbringer system_bot]$ python
Python 2.7.15 (default, Jun 27 2018, 13:05:28) 
[GCC 8.1.1 20180531] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import psutil
>>> psutil.virtual_memory().percent

74% of RAM is in use.  Seems pretty high, and it is when you take into account my web browser, Joplin, Atom with the source code for two bots open in separate windows... but a lesser-known thing is that the amount of cached information in memory can be freed up to make room for more processes.  Let's look at the output of the free(1) command:

[drwho@windbringer ~]$ free -m
              total        used        free      shared  buff/cache   available
Mem:          15943        9449         389        1986        6104        4179
Swap:             0           0           0
[drwho@windbringer ~]$ pcalc 15943*0.74
    11797.82            0x2e15                0y10111000010101
[drwho@windbringer ~]$ pcalc 15943-11798
    4145                0x1031                0y1000000110001

4,145 megs of RAM left.  That's about 25% of Windbringer's available RAM, and it is getting a little tight in here.  However, while technically you're not supposed to add the amounts of buffered and cache memory to the amount of available (or unallocated) memory anymore, the point remains that the Linux kernel's memory management subsystem can and will drop stuff from the former to add to the latter as needed.  Doing a little math that some people say you shouldn't do anymore...

[drwho@windbringer ~]$ pcalc 6104+4179
    10283               0x282b                0y10100000101011
[drwho@windbringer ~]$ pcalc \(10283/15943\)*100
    64.49852599887097    0x40                  0y1000000

10,283 megs of RAM available, or about 64.5%.  Using two Electron applications aside, Windbringer's not doing too badly, all things considered.

You're probably wondering why I'm going to all this trouble writing a dinky little bot that does system monitoring.  It isn't as if there's a shortage of system monitoring applications out there.  The answer is... I've used performance monitoring software before.  Some of it I liked, some of it I didn't, some of it was way too expensive to keep around.  But something I hadn't yet done was try to write my own to get a good understanding of what goes into performance monitoring.  I'm very sure that of everything I've used, I haven't come close to using them as thoroughly as I could have.  Also, configuring performance monitoring is notoriously difficult because it takes fairly in-depth knowledge of how a system works to do it correctly and get accurate information.  If nothing else, many monitoring packages drown you in stuff you'll never use and make it hard to find the stuff you really care about.  By writing a monitoring system myself, I figured that I'd get a better grasp of what everything does, and more importantly why something is or is not important.  That what I'm writing also happens to be an interactive bot is a bonus.