If you've been following the development activity of Systembot, the bot I wrote to monitor my machines (physical as well as virtual) you've probably noticed that I changed a number of things around pretty suddenly. This is because the version of Systembot in question had some pretty incorrect assumptions about how things should work. For starters, I thought I was being clever when I wrote the temperature monitoring code when I decided to use what the drivers thought were high or critical values for sending "something is wrong" alerts. No math (aside from a Centigrade-to-Fahrenheit conversion), just a couple of values helpfully supplied by the drivers by way of psutil (which is a fantastic module, by the way; I don't play with it enough). This was hunky-dory until Leandra started running a backup job and her CPU temperature spiked to 125 degrees Fahrenheit while encrypting the data. 125 degrees isn't terribly hot as servers go, but the lm_sensors drivers seem to disagree. Additionally, my assumptions of how often to send the "high temperature" alerts (after every four cycles through the "do stuff" loop) were... naive? Optimistic?
Let's go with optimistic.
What it boiled down to was that I was getting hammered with "temperature is too high!" warning messages roughly six times a second. Some experiments with changing the delay were equally optimistic and futile. I bit the bullet and made the delay-between-alerts configurable. What I have yet to do is make the frequency of different kinds of warning events configurable, because right now they all use the same delay (defined in time_between_alerts). Setting this value to 0 disables sending warnings entirely. This is less suboptimal at best but it's not waking me up every few seconds so I think it'll hold for a couple of days until I can break this logic out a little.
The second assumption that came back to bite me (hardcoding values until something like this happened aside) was that alerting on 80% of a disk being in use without any context isn't necessarily a good idea. My media server at home was also chirping several times a second because one of the hard drives is currently at 85% of capacity. This seems reasonable at first scratch but when you dig a little deeper it's not. 85% of capacity in this case means that there are "only" 411 gigabytes of space left on a 4 terabyte hard drive. Stuff doesn't get added to that drive very often, so that 400+ gigs will last me another couple of months, at least. There's no reason to alert on this, so making this value a parameter in the config file buys me some time before I have to buy another hard drive.
A similar problem came up with monitoring the amount of usable memory left on a system. I'd been taking psutil at its word so that I'd have to do less math:
(env) [drwho@windbringer system_bot]$ python Python 2.7.15 (default, Jun 27 2018, 13:05:28) [GCC 8.1.1 20180531] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import psutil >>> psutil.virtual_memory().percent 74.0
74% of RAM is in use. Seems pretty high, and it is when you take into account my web browser, Joplin, Atom with the source code for two bots open in separate windows... but a lesser-known thing is that the amount of cached information in memory can be freed up to make room for more processes. Let's look at the output of the free(1) command:
[drwho@windbringer ~]$ free -m total used free shared buff/cache available Mem: 15943 9449 389 1986 6104 4179 Swap: 0 0 0 [drwho@windbringer ~]$ pcalc 15943*0.74 11797.82 0x2e15 0y10111000010101 [drwho@windbringer ~]$ pcalc 15943-11798 4145 0x1031 0y1000000110001
4,145 megs of RAM left. That's about 25% of Windbringer's available RAM, and it is getting a little tight in here. However, while technically you're not supposed to add the amounts of buffered and cache memory to the amount of available (or unallocated) memory anymore, the point remains that the Linux kernel's memory management subsystem can and will drop stuff from the former to add to the latter as needed. Doing a little math that some people say you shouldn't do anymore...
[drwho@windbringer ~]$ pcalc 6104+4179 10283 0x282b 0y10100000101011 [drwho@windbringer ~]$ pcalc \(10283/15943\)*100 64.49852599887097 0x40 0y1000000
10,283 megs of RAM available, or about 64.5%. Using two Electron applications aside, Windbringer's not doing too badly, all things considered.
You're probably wondering why I'm going to all this trouble writing a dinky little bot that does system monitoring. It isn't as if there's a shortage of system monitoring applications out there. The answer is... I've used performance monitoring software before. Some of it I liked, some of it I didn't, some of it was way too expensive to keep around. But something I hadn't yet done was try to write my own to get a good understanding of what goes into performance monitoring. I'm very sure that of everything I've used, I haven't come close to using them as thoroughly as I could have. Also, configuring performance monitoring is notoriously difficult because it takes fairly in-depth knowledge of how a system works to do it correctly and get accurate information. If nothing else, many monitoring packages drown you in stuff you'll never use and make it hard to find the stuff you really care about. By writing a monitoring system myself, I figured that I'd get a better grasp of what everything does, and more importantly why something is or is not important. That what I'm writing also happens to be an interactive bot is a bonus.