Have you tried turning it off and back on again?

03 June 2019

Disclaimer: The content of this post does not reflect my current employer, or any of my clients at present.  I've pulled details from my work history dating back about 20 years and stitched them into a more-or-less coherent narrative without being specific about any one company or client because, as unfashionable as it may be, I take my NDAs seriously.  If you want to get into an IT genitalia measuring contest please close this tab, I don't care and have no interest.

Time was, back in the days of the home 8-bit computers, we were very limited in what we could do in more than one way.  Without even a proper reset button or development tools other than the built-in BASIC interpreter if something went wrong there was really no way that you could debug it.  If you happened to be hacking code in any serious way on the Commodore chances are you'd shelled out good money for a debugger or disassembler and had at least a couple of reference books nearby.  If you were doing everything in BASIC then either you were growing your program a few lines at a time or using some code you got out of a magazine to do low level programming from inside of BASIC (an exercise fraught with frustration, let me tell you).  Even then, if something went sideways it was difficult to figure out where you went wrong and fix it.  The tools just weren't common at the time.  All you could really do was turn off the machine, wait a few seconds, turn it back on, and give it another shot in the hope that the machine wouldn't lock up on you again.

Now let's jump ahead a couple of decades to the twenty-first century.  We're writing just about everything server side in Java and deploying .jar files that are anywhere from a couple of tens of gigabytes to a couple of hundred gigabytes in size, counting the dozens of dependencies that have to be installed and/or are packaged with your application to make the bloody thing work.  Dozens upon dozens of libraries that do something so your devs don't have to tear their hair out re-inventing innumerable wheels yet again.  We have cloud hosting providers offering everything and the kitchen sink that you might decide to use for building whatever your enterprisey at-scale thingy is.  And because we can't leave well enough alone, we have to take perfectly good servers and dice them up into virtual machines, and then run clusters of containers inside of those virtual machines with virtual networks tying them all together.  And once you start getting really creative you have to figure out how to get everything communicating appropriately, which means managing configuration information and secrets.  And then throw in load balancers and collecting metrics to monitor it all.  It's a virtual Cambrian explosion of software components.

Eventually you've taken what seemed at first like a fairly straightforward system and sent its complexity spaceward.  And sometimes that's what you have to do - if you want tens of thousands of people using the same thing at the exact same second day in and day out you have to have enough resources for them to hammer on.  Gone are the days where you could stand up one database server, one web server, and one application server on the same subnet and have a product.

The thing about all of these dependencies - all of these databases, application servers, web servers, streaming pipelines, message queues, microservices, and whatever else may come - is that at some point you can't know for sure at any given moment what's going on anymore.  As these systems become more complex emergent behaviors appear.  Some work in your favor, some are just there and you have to learn how to live with them because they're strange but not hurting anything, and some can being the whole shebang crashing down around your ears faster than you can throw your work phone against the wall.  Sometimes it happens that there are very tight tolerances on timing or communication delays in between components and a few extra microseconds (even mere milliseconds are too long!) in between an event being sent and being received can cause multiple components to stop functioning correctly.  Or a third party application decides that it's going to stop responding, peg the CPUs, and cause the rest of the system to slam to a dead stop and you probably don't have any easy way of getting at its insides to see what's happening.  Or there's a bug in an undocumented API that one of your dependencies uses that your code tickled in just the right way to freeze it solid.

The hell of it is, there are no tools for debugging these sorts of emergent properties in a complex system.  It isn't any one component of the system that's acting up to cause trouble, it's the cumulative and sometimes chaotic interactions of every component with every other component resulting in a stunning demonstration of the principle of definitional equivalence.  Who knew that a job scheduler that didn't have anything to fire off for just two seconds would cause a database connector to crash and drop all of its active connections?  How is it that a service bus with a number of customer services that's an even power of 3 will start randomly throwing away messages before any of the customer services can grab them for processing?  Why does running an application inside a container cause it to crash when it gets too busy, but running it on its own virtual machine with an overabundance of system resources that it never makes use of is just fine, but it refuses to even start on the next virtual machine size down?  Who could have predicted that deep inside a third-party service is a leftover check to see if it's running as root (which you should never, ever do without a good reason) which caused it to crash without warning?

Nobody has any idea.  This stuff comes out of nowhere like a meteor on final approach.  You can spend days and weeks and months throwing every utility and trick in the book at it and you'll be no closer to understanding the reasons than you were when you started.  Audit the code all you like because eventually you'll stop noticing the bugs that'll come back to bite you the next time you're on call.  All you can do is log in, terminate the box with extreme prejudice, and let your provider automatically fire up a replacement.  Answers?  Zero if you're lucky.  But no outage, and no angry customers flaming you on Twitter about your SLA.

We've basically come full circle, as the industry tends to do.  We started off as "pull the power, it's locked up again" to having tools available to diagnose and fix bugs.  And it was good for a while.  But then we somehow went all the way back around to "just reboot it, because we can't fix it."  And that's really all we can do because that complexity defines the baseline of everyday life.  At the very least, stuff comes back up most of the time.