The overall state of telecommunications.

Dec 08 2019

I'm writing this article well before the year 2020.ev starts, mostly due to the fact that Twitter's search function is possibly the worst I've ever seen and this is probably my last chance to find the post in question to refer back to.

Late in November of 2019.ev a meme was going around birbsite, "Please quote this tweet with a thing that everyone in your field knows and nobody in your industry talks about because it would lead to general chaos."  Due to the fact that I was really busy at work at the time I didn't have a chance to chime in, but then an old friend of mine (and, through strange circumstances, co-worker for a time) told an absolute, unvarnished truth of the telecom industry: "Telecommunications as a whole, which also encompasses The Internet, is in a constant state of failure and just in time fixes and functionally all modern communication would collapse if about 50 people, most of which are furries, decided to turn their pager off for a day."

I don't know of any words in the English language to adequately express how true this statement is.  He's serious as the proverbial heart attack.  For a brief period of time, one solar year almost to the minute in fact, I worked for a telecommunications company in Virginia that no longer exists for reasons that are equal parts fucked up and illegal.  The company was bought out and dismantled roughly a year after I escaped by Zander's employer at the time, and seeing as how this was about fifteen years ago as you read this, I guess I can talk in public about it.

tl;dr - If you value your physical and mental health, don't work in telecom.

Let's start with hardware the stuff you almost never see because it's packed away in data centers and unassuming, unmarked offices all over the place.  The company's network infrastructure was a massive Cisco router, the make and model of which I don't recall (see also: fifteen years).  I think it took up 20U in a 44U rack, probably a Cisco 7609.  It was crap.  I think it fell off the back of a truck, both literally and figuratively.  The router in question would drop something like 10% of all the traffic going through it at any one time, which lead to calls being inexplicably dropped or never completing.  Even though the company had a Cisco support contract and their field circus came out to the cage in the data center numerous time to troubleshoot, during which time they had swapped out every module in that router at least twice, they never managed to fixed the problem.  That left the only component they hadn't (and in point of fact, could not without a complete company-down outage) replaced: The backplane, or the chassis that actually got bolted into the rack which everything slots into.  And they flat out refused to even have that conversation, they shut the discussion down each and every time.  By the time I left, that damned thing had woken me up at least once a night all year when my pager went off.

Now let's talk about the media gateway devices the company was using in production.  To make a long story short, these gizmos cost a shit-ton of money and convert one kind of telephony traffic (VoIP) into another (POTS).  These units worked well enough until it came to monitoring them.  At the time, SNMP (Simple Network Monitoring Protocol) (just as often, however, it also stood for Some Nimrod Made Promises) was the gold standard for monitoring your hardware.  There was a bug in the firmware our units were using that would cause the units to lock up solid after you polled them for system stats after some period of time (usually less than 24 hours).  By "solid" I mean that the media gateways wouldn't even respond to out-of-band-management, which meant that an engineer (usually me) had to drive out to the data center, get into the cage, find the crashed unit, physically pull the power cable for ten minutes (no, I don't know why it took that long, only that it did), plug it back in, and coordinate with someone back at the office to make sure that it came back up (which it wouldn't one time out of twenty or twenty one).  The manufacturer (I don't remember who it was, but I think they're out of business now, wonder why) told us that they were not going to fix this bug (even after we demonstrated it) and if we wanted metrics on our media gateways we'd have to log in remotely and run a sequence of commands every five minutes to get stats, which would then have to be copy-and-pasted into an Excel spreadsheet which had some embedded macros to process the data and, if things were too far out of whack send an e-mail to an internal mailing list.  This inevitably lead to engineers trying to troubleshoot the media gateway without crashing it outright, and usually the least sleep deprived network engineer having to drive out to the data center and cycle power on all the units "just in case."

The actual voice-over-IP stuff, the core functionality of the company, was implemented with a piece of enterprise telecom software from a company called Sylantro, which was acquired a few years later by some suckers who probably didn't know what a mistake they were making.  Sylantro would only sell the company clusters of four Sun V440 servers which they had built and vetted themselves for a cost somewhere in the low hundreds of thousands of dollars.  The base configuration was two servers doing actual telecom stuff (connecting calls, logging, the basic stuff) and two others which were dedicated database servers built on top of Systems Directory Server from Sun Microsystems (which also no longer exists in any meaningful way), a piece of software that was such garbage that even Sun disavowed all knowledge of it.  The single biggest problem we had with SDS was that it would regularly corrupt itself for no good reason, and maybe once a month it would corrupt itself entirely for no reason we could discern, killing the entire cluster and resulting in an outage.  The company had to hire someone whose full time, on-call job it was to use undocumented, don't-tell-Sylantro-or-they'll-cancel-their-joke-of-a-support-contract tools (most of which he wrote himself) to repair and maintain the LDAP server that Sylantro was grossly misusing.

I call it a cluster only to be polite: Clustering implies that all of your servers are supporting each other at the same time.  Sylantro only did something that would be called failover clustering, which means that one VoIP server was doing stuff and the other was hanging around picking its nose doing nothing.  Same with the SDS database servers that held all the customer data.  Once again, this meant an outage for between 50,000 and 75,000 customers.  All four servers were directly connected to each other, ostensibly to keep tabs on each other.  The thing is, one server would die and the other would take ten to twenty minutes to read in and cache the entire contents of the SDS datastore (rather than querying it as necessary via the back-end network) before it would start processing calls. The server that died would become the backup server in the cluster.  Needless to say, this does not make for happy customers.  We spent weeks using tcpdump on each cluster's back-end network and scouring system logs to determine what was causing the random failures.  As near as we could tell, the servers just felt like failing over.  Sylantro-the-company found this amusing and promised us they'd fix it in the next release.  The rest of us not so much.

The servers they sold the company were supposed to have hot-swap capable SCSI controllers.  In theory, when a hard drive would fail you could send an engineer to the data center to pop the dead drive out, pop in a replacement, and the controller would rebuild the RAID with no-one the wiser.  In practice, however, it would crash the server back to the Openboot PROM's "ok " prompt, which had the side effect of basically locking the server up, resulting in yet another outage.  Due to the fact that the company only scheduled hardware maintenance windows after a certian number of drives in a given rack had failed, this often resulted in having to reboot vast swaths of hung servers all at once, an activity just as likely to cause an even bigger outage because sometimes the Sylantro servers would never synch back up.  The only way to get Sylantro nodes to synch back up if they failed the first time was... you guessed it, reboot the entire cluster again.  Due to how long it used to take for Solaris 8 to boot up, this could easily mean a multiple-hour outage (and did at least twice when it was my turn in the hotseat).

Now let's talk about the hardware the company's customers had to put up with.  The company would ship subscribers ATAs - analog telephony adapters - that they would plug their plain-old landline phones (or the actually kind of nifty wireless phones they threw in as a bonus) into one of the jacks and the network jack into their home LAN.  In theory the ATA would boot up, associate with the company's back-end VoIP infrastructure, and you could "just make phone calls." In theory.  In practice the ATA units were utter garbage, with a failure rate approaching 50% within six months.  There were so many RMAs that the NOC staff kept a cardboard box that they would dump the returned units into, and whenever it got full they'd ship it back to the manufacturer and a crate of replacement units would arrive the next week.  For a time the company tried issuing travel-sized ATAs to some customers, units about the size of an ashtray (remember those), or a bit bigger than a deck of cards.  They ran so hot that we called them travel irons, and there were at least two reported cases of the damned things catching on fire.  The customers sent them back to us, and sure as shit they had clear signs of fire damage, emanating from the inside of the shitty plastic case.  We weren't able to tell what component(s) were responsible because the damage was too extensive.  Suffice it to say that particular project was cancelled with some rapidity.  However, and honesty bids me to state this clearly, the rest of the telephony market is far, far, far better off and you will probably never see this happen because other telephone companies don't cheap out as badly as my former employers did.  As far as I can tell (and I'll probably hear some horror stories at the next lobbycon I attend), hardware catching on fire just isn't a thing.

Now, as for the bit about furries in telecommunications?  I can completely confirm this.  As far as I know I was the only one at the company in question (and it did come up as a topic of conversation a couple of times), running a couple of SELECTs on my contact database I only seem to know one person out of 36 who has a network engineering background but is not a furry.  If you ever wonder why you're having trouble getting anywhere in tech support at your telco of choice you might want to google which furry con is being held that week.  There is also a saying which more folks should keep in mind: "Be polite to the furries you work with, because they have access to the data center and you do not."  Meaning, if the being in question is out sick (which happens distressingly often in telecom), "on vacation," or at a con, the one person you can escalate problems to is not going to be available, leaving you to swing in the wind or figure it out for yourself.  Now apply this principle to a very large IXP (Internet eXchange Point) which may have at least one (and probably more than one) transoceanic cable jacked into it and you'll get an idea of the scope of this statement.

All said and done, modulo my ex-employer's customer hardware being a fire hazard, this is roughly average for the telecommunications industry, if not in the "not bad, actually" part of a scale ranging from, as Noelle once put it, a scale of "one" to "attacking Jackie Chan in a ladder factory while he's holding a baby after he's just said he doesn't want any trouble."  Day in and day out, 24 hours a day, seven days a week, 365 days a year the infrastructure allowing you to read this blog post is completely, totally, and utterly fucked.  A handful of people are rebooting shit that they can't (and the manufacturers refuse to) fix because that's the only thing they can do.  Weird-ass routes run your traffic through three other tier-one providers in different parts of the country just so you can ping your next door neighbor.  There may be a brand-new college grad with a sleeping back and an alarm clock sleeping in a cage in the middle of nowhere because they have to cycle power on a mission-critical device every sixteen hours (no more, no less).  Someone may have stuffed a length of optical fibre through the wall of their cage in the data center, ran it across the aisle of the colo, and through the screen of another cage that their suitemate from Midwest Fur Fest runs so said buddy can plug it into his border router and provide a working link due to an outage at an upstream provider.  Said outage could be due to one of the admins at tier-one ISP "A" having broken up with one of the admins at tier-one ISP "B" and flushed the routes between the two companies, resulting in a cascading outage that took out half a state.

(Yes, this has happened.)

And now, music.