Misadventures in IT.

25 January 2012

I don't ordinarily write much about work, mostly because it's not that interesting but also because it's a bad habit to get into, lest I let something critical slip and get in trouble. However, the last two days were sufficiently rough (and strange) that I feel that I have to write something about it, if only to give my fellow BOFHes something to go on if they find themselves in the same particular position I was. The past two days have been by far the strangest problem I've ever run into working in IT or information security.

Let's set up the scenario: We have a file server with approximately eight terabytes of data RAIDed across eight drives, eight CPUs, 4 GB of RAM, and a particular all-steel server class case with a 750 watt power supply in it. It's running Linux. Early Tuesday morning I recieved a warning from Logwatch that one of the hard drives in the RAID had failed. The array was still functional because the mirror's twin drive was still operational but it's always a good idea to replace a dead drive as soon as possible, lest the other die at the worst possible moment. So, I powered the server down, hauled it up to my office, and set to work. I figured that as long as I had the server offline I could install the set of hot-swappable drive bays I bought last year to save myself having to crack the case open every time I had to replace a drive.

Now let's lay out the symptoms...

After installing the hot-swap sleds, nothing would spin up. The fan in the power supply, the CPU fan, the exhaust fans, even the fans built into the sleds refused to power up. The hard drives similarly refused to spin up, nor did the display. Dead machine. I double checked the power cable, moved the power cable to a different outlet, checked the breaker box, checked the cutoff switch on the back of the power supply, and even checked the power cables inside to make sure they were all seated properly. No burst or bulging caps on the mainboard, either. I pulled all of the cables and tried again. Same thing. Waited fifteen and thirty minutes, same thing. I tore everything out of the machine to replace the power supply with one that is known to work; same result, or lack thereof.

For the purposes of this discussion you can safely assume that I swapped out everything I possibly could in the machine.

At the very end I decided to transplant the original guts of the server into a new chassis that I'd pulled down off the shelf because there was nothing else that I could reasonably replace. Much to my surprise, my initial smoke test (original power supply, mainboard, CPU, RAM, and graphics card) booted successfully. I then reconnected three of the drives and was able to boot once again. After a few more attempts, each with successively more components, I was able to successfully boot the server into single-user mode and begin rebuilding the array. I scrounged up a couple of single-drive hot-swappable drive sleds and used them for some of the hard drives because the (very expensive) drive bays I'd purchased last year are, as far as I can tell, bupkis. They don't work. There aren't enough exposed 5.25" drive bays in the new chassis for them, anyway, so I would have been able to use at most one of them if they did.

So, the $50kus question is, why in the hell would swapping out the chassis make a difference? I haven't the foggiest notion but I do have a few guesses. The first is that either the power or the reset button on the front panel had failed somehow, either the former jamming open or the latter jamming closed. Physical failures do happen after all, case in point the hard drive that started this debacle. It's possible that there was a short between points on the bottom of the mainboard against the tray the mainboard was bolted to. When I pulled it off I didn't see anything of the sort but that doesn't mean there wasn't one. Going a bit farther afield on the matter there may have been an obscure-to-the-coder electrical phenomenon but you may as well chalk that up as speculation.

At any rate, if you ever happen to find yourself in the position of having to resuscitate a server that won't spin up at all (from the exhaust fans to the hard drives to the CPU itself), and you swap out the power supply and that doesn't fix it, try moving the guts into a different chassis and running restart tests in reasonably small increments to see if that fixes it.