Calculating entropy with Python.

Sep 13 2020

Fun fact: There is more than one kind of entropy out there.

If you've been through high school chemistry or physics, you might have learned about thermodynamic entropy, which is (roughly speaking) the amount of disorder in a closed system.  Alternatively, and a little more precisely, thermodynamic entropy can be defined as the heat in a volume of space equalizing throughout the volume.  But that's not the kind of entropy that I'm talking about.

Information theory has its own concept of entropy.  One way of explaining information theory is that it's the mathematical study of messages as they travel through a communications system (which you won't need to know anything about for the purposes of this article).  In the year 1948.ev Claude Shannon (the father of information theory) wrote a paper called A Mathematical Theory of Communication in which he proposed that the amount of raw information in a message could be thought of as the amount of uncertainty (or perhaps novelty) in a given volume of bits (a message) in a transmission.  So, Shannon entropy could be thought of as asking the question "How much meaningful information is present in this message?"  Flip a coin and there's only one bit - heads or tails, zero or one.  Look at a more complex message and it's not quite so simple.  However, let's consider a computational building block, if you will:

One bit has two states, zero or one, or 21 states.  Two bits have four possible states: 00, 01, 10, and 11, or 22 possible states.  n bits have 2n possible states, which means that they can store up to n bits of information.  Now we bring in logarithms, which we can think of in this case as "what number foo would we need in 2foo to represent the number of bits in a message?"

Simple environment monitoring with spare parts.

Jul 04 2020

It's going on summer in the Bay Area, which means that it's warming up a bit both outside and inside (because air conditioning is Not A Thing out here).  That, coupled with the not inconsiderable research infrastructure I have at home has left me wondering and worrying about just how hot my office gets during the day while I'm working.  Now, I could just put a simple little thermometer on my shelf (and I did) but my concerns are a bit bigger than that.  What happens if my office temperature reaches a critical point and servers start melting down on me?  I've dealt with heat damage in the past and don't particularly care to shell out a grand or so to replace parts that flatlined because I was away from the house and couldn't respond in time.  That, and the fact that I need to keep my mind busy while I'm stuck in quarantine if I'm going to be honest, are the reason why I built yet another weird-assed exocortex project: A relatively simple hardware monitor connected to a Raspberry Pi, and a bot that listens for commands and responds with what it can detect of the local temperature or humidity when I send it a message.

Setting up a private Matrix server.

Jan 11 2020

EDIT - 20200804 - Updated the Nginx stanzas because the newer versions of Certbot do all the work of setting up SSL/TLS support for you, including the most basic Nginx settings.  If you have them there you'll run into trouble unless you delete them or comment them out.  Also, Certbot centralizes all of the appropriate SSL configuration and hardening settings into a single includable file (/etc/letsencrypt/options-ssl-nginx.conf) for ease of maintenance.

A couple of years ago I spent some time trying to set up Matrix, a self-hosted instant messaging and chat system that works a little like Jabber, a little like IRC, a little like Discord and a little like Slack.  The idea is that anyone can set up their own server which can federate with other servers (in effect making a much larger network), and it can be used for group chat or one-on-one instant messaging.  Matrix also has voice and video conferencing capabilities so you could hold conference calls over the network if you wanted.  For example, one possible use case I have in mind is running games over the Matrix network.  You could even build more exotic forms of conferencing on top of Matrix if you wanted to.  Even more handy is that the Matrix protocol supports end-to-end encryption of message traffic between everyone in a channel as well as between private chats between pairs of people.  If you turn encryption on in a channel it can't be turned off; you'd have delete the channel entirely (which would then cause the chat history to be purged).

Chat history is something that was a stumbling block in my threat model the last time I ran a Matrix server, somewhen in 2016.  Things have changed quite a bit since then.  For usability Matrix servers store chat history in their database, in part as a synchronization mechanism (channels can exist across multiple servers at the same time) and in part to provide a history that users can search through to find stuff, especially if they've just joined a channel.  For some applications, like collaboration inside a company this can be a good thing (and in fact, may be legally required).  For other applications (like a bunch of sysadmins venting in a back channel), not so much.  This is why Matrix has three mechanisms for maintaining privacy: End to end encryption of message traffic (of entire channels as well as private chats), peer-to-peer voice and video using WebRTC (meaning that there is no server that can record the traffic, it merely facilitates the initial connection), and deleting the oldest chat logs from the back-end database.  While it is true that there is no guarantee that other servers are also rotating out their message databases, end-to-end encryption helps ensure that only someone who was in the channel would have the keys to decrypt any of it.  It also seems feasible to set up Matrix channels such that all of the users are on a single server (such as an internal chat) which means that the discussion will not be federated to other servers.  Channels can also be made invite-only to limit who can join them.  Additionally, who can see a channel's history and how much of it can be set on a by-channel basis.

For the record, on the server I built for writing this article the minimum lifetime of conversation history is one calendar day, and the maximum lifetime of conversation history is seven calendar days.  If I could I'd set it to Signal's default of "delete everything before the last 300 messages" but Synapse doesn't support that so I tried to split the difference between usability and privacy (maybe I should file a pull request?)  A maintenance mole crawls through the database once every 24 hours and deletes the oldest stuff.  I could probably make it run more frequently than that but I don't yet know what kind of performance impact that would have.

One of the things I'm going to do in this article is gloss over the common fiddly stuff.  I'm not going to explain how to create an account on a server because I'm going to assume that you know how to look up instructions for doing that.  Hell, I google it from time to time because I don't do it often.  I'm also going to break this process up into a couple of articles.  This one will give you a basic, working install of Synapse (a minimum viable server, if you like).  I also won't go over how to install Certbot (the Let's Encrypt client) to get SSL certificates even though it's a crucial part of the process.  I will explain how to migrate Synapse's database off of SQLite and over to Postgres for better performance in a subsequent article.  For what it's worth I have next to no experience with Postgres, so I'm figuring it out as I go along.  Seasoned Postgres admins will no doubt have words for me.  After that I'll talk about how to make Matrix's VoIP functionality work a little more reliably by installing a STUN server on the same machine.  Later, I'll go over a simple integration of Huginn with a Matrix server (because you just know it's not a technical article unless I bring Huginn into it).

A piece of advice: Don't try to go public with a Matrix server all at once.  The instructions are complex and problematic in places, so this article is written from my notes.  Take your time.  If you rush it you will screw it up, just like I did.  Get what you need working, then move on to the next bit in a day or so.  There's no rush.

Summer vacation is rapidly coming to an end.

Aug 31 2019

It seems as if another summer is rapidly coming to an end.  The neighbors' kids are now back in school, school buses are now picking their way down the streets, and due to Burning Man coming up it's now possible to eat in a real restaurant in the Bay Area for the next couple of days.  I've been pretty quiet lately, not because I've been spending any amount of time offline but because I've been spending more time doing stuff and just not writing it up.  I've been tinkering with Systembot lately, adding functionality that I really have a need for at home, namely, remotely monitoring a wireless access point running OpenWRT in the same way that I watch the rest of my stuff.  Due to the extreme system constraints on your average high-end wireless access point (2 CPUs, 128 megs of storage, 512 megs of RAM) it's not feasible to install Python and a Halo checkout, so I had to figure out how to get the system stats I need remotely.  What I wound up doing was standing up another copy of the standard OpenWRT web server daemon and writing a bunch of tiny CGI scripts which run local commands and return the information to Systembot for processing and analysis.  It wound up being a fun exercise in working with tight constraints, though I think there are still some bugs to be shaken out.

An annoying problem solved: Accessing JSON documents with an API.

Jul 28 2019

I spend a lot of time digging around in other people's data.  If I'm not hunting for anything in particular then it's a bit of a crapshoot, to be honest, if only because you never know what you're in for.  You can pretty much take it to the bank that if you didn't assemble it yourself, you can't count on it being complete, well formed, or anything approximating the output of a human being (it usually came out of a database, but I think you see what I'm getting at).  Sometimes, if I'm really lucky I'll just get hold of a JSON dump of the database, which to be fair is better than nothing when there isn't even an API to use.  From time to time I'll make an attempt at fitting the data into a database of some kind, sometimes MySQL, sometimes SQLite, or occasionally an API layer like Sandman2.  This is all well and good, but it winds up being more of an adventure than I'm looking for.  I'd much rather be Indiana Jones prowling around in the temple than Rambo going through a preparation montage because Indy was actually getting stuff done.

Wow, this article went a little off the rails.  I was never good at writing intros to new code... anyway.

Hacking around memory limitations in shared hosting.

May 30 2019

Longtime readers are aware that I've been a customer of Dreamhost for quite a few years now, and by and large they've done all right by me.  They haven't complained (much) about all the stuff I have running there, and I try to keep my hosted databases in good condition.  However, the server they have my stuff on is starting to act wonky.  Periodic outages mostly, but when my Wallabag installation started throwing all sorts of errors and generally not working right, that got under my skin in a fairly big hurry.  I reinstalled.  I upgraded to the latest stable release.  I installed the latest commit from the source code repository401 and 500 errors as far as the eye could see whenever I tried to do anything regardless of what I did.

In a misguided attempt to figure out what was going on, I bit the bullet and installed PHP on one of my servers, along with all of the usual dependencies and tried to replicate my setup at Dreamhost.  What that was a bit tricky and took some debugging I eventually got it to work.  It was getting my data out of the sorta-kinda-broken setup that proved troublesome.

Systembot: Adventures in system monitoring.

Dec 28 2018

If you've been following the development activity of Systembot, the bot I wrote to monitor my machines (physical as well as virtual) you've probably noticed that I changed a number of things around pretty suddenly.  This is because the version of Systembot in question had some pretty incorrect assumptions about how things should work.  For starters, I thought I was being clever when I wrote the temperature monitoring code when I decided to use what the drivers thought were high or critical values for sending "something is wrong" alerts.  No math (aside from a Centigrade-to-Fahrenheit conversion), just a couple of values helpfully supplied by the drivers by way of psutil (which is a fantastic module, by the way; I don't play with it enough).  This was hunky-dory until Leandra started running a backup job and her CPU temperature spiked to 125 degrees Fahrenheit while encrypting the data.  125 degrees isn't terribly hot as servers go, but the lm_sensors drivers seem to disagree.  Additionally, my assumptions of how often to send the "high temperature" alerts (after every four cycles through the "do stuff" loop) were... naive? Optimistic?

Let's go with optimistic.

What it boiled down to was that I was getting hammered with "temperature is too high!" warning messages roughly six times a second.  Some experiments with changing the delay were equally optimistic and futile.  I bit the bullet and made the delay-between-alerts configurable.  What I have yet to do is make the frequency of different kinds of warning events configurable, because right now they all use the same delay (defined in time_between_alerts).  Setting this value to 0 disables sending warnings entirely.  This is less suboptimal at best but it's not waking me up every few seconds so I think it'll hold for a couple of days until I can break this logic out a little.

The second assumption that came back to bite me (hardcoding values until something like this happened aside) was that alerting on 80% of a disk being in use without any context isn't necessarily a good idea.  My media server at home was also chirping several times a second because one of the hard drives is currently at 85% of capacity.  This seems reasonable at first scratch but when you dig a little deeper it's not.  85% of capacity in this case means that there are "only" 411 gigabytes of space left on a 4 terabyte hard drive.  Stuff doesn't get added to that drive very often, so that 400+ gigs will last me another couple of months, at least.  There's no reason to alert on this, so making this value a parameter in the config file buys me some time before I have to buy another hard drive.

Simple things can be hard.

Sep 18 2018

As the title of this post implies, I've been working on some stuff lately that's been taking up enough compute cycles that I haven't been around to post much.  Some of this is due to work, because we're getting into the really busy time of year and when I haven't been at work I've been relaxing.  Some of this is due to yet another run of dental work that, while it hasn't really been worth writing about has resulted in my going to bed and sleeping straight through until the next day.  And some of it's due to my hacking on a new project that wound up being... not as hard as I'd imagined it would be, but there certainly has been a steep learning curve.

Parsing simple commands in Python.

Feb 04 2017

A couple of weeks ago I ran into some of the functional limits of my web search bot, a bot that I wrote for my exocortex which accepts English-like commands ("Send me top 15 hits for HAL 9000 quotes.") and runs web searches in response using the Searx meta-search engine on the back end.  This is to say that I gave my bot a broken command ("Send hits for HAL 9000 quotes.") and the parser got into a state where it couldn't cope, threw an exception, and crashed.  To be fair, my command parser was very brittle and it was only a matter of time before I did something dumb and wrecked it.  At the time I patched it with a bunch of if..then checks for truncated and incorrect commands, but if you look at all of the conditionals and ad-hoc error handling I probably made the situation worse, as well as much more difficult to maintain in the long run.  Time for a rewrite.

Back to my long-term memory field.  What to do?

I knew from comp.sci classes long ago that compilers use things called parsers and grammars to interpret code so that it can be converted into an executable.  I also knew that the parser Infocom used in its interactive fiction was widely considered to be the best anyone had come up with in a long time, and it was efficient enough to run on humble microcomputers like the C-64 and the Apple II.  For quite a few years I also ran and hacked on a MOO, which for the purposes of this post you can think of as a massive interactive fiction environment that the players can modify as well as play in; a MOO's command parser does pretty much the same thing as Infocom's IF parser but is responsive to the changes the user's make to their environments.  I also recalled something called a parse tree, which I sort-of-kind-of remembered from comp.sci but because I'd never actually done anything with them, I only recalled a low-res sketch.  At least I had someplace to start from so I turned my rebooted web search bot loose with a couple of search terms and went through the results after work.  I also spent some time chatting with a colleague whose knowledge of the linguistics of programming languages is significantly greater than mine and bouncing some ideas off of him (thanks, TQ!)

But how do I do something with all this random stuff?