keeping das Blinkenlights on

The Junkyard Server Farm has grown up. The primary data center has over 30 servers and serves a number of diverse users. As it has grown from a couple of old, junked machines on a desk, to taking up most of a walk-in closet, to a growing motley collection of machines on the floor, to getting a formal equipment rack (not a 19″ EIA rack; rack mount equipment is still way too overpriced for us), I have had opportunity to re-cable everything. Between times, as machines get added or replaced and new networking requirements (IPv6!) come into play, cables get added without the benefit of careful dressing through established channels, and things begin to look rather a mess.

That situation has worsened considerably since March, when I had to add high-reliability video teleconferencing to our mission at a time when I was working 12-hour days and trying to keep up with all the issues around the SARS-CoV-2 pandemic.

This week, I decided to take a break from direct patient care duties to pay off some technical debt and re-rack the primary datacenter server farm. The main issue was power management, which I’m afraid I’ve often treated as an afterthought. I designed a protocol for moving servers from the old rack to the new one in a way that would minimize downtime, but that first involved doing the power wiring for the new rack and I realized just how complex it had become. I sent this in an email to my colleagues this morning explaining what I had been up to:

I migrated the first server yesterday. What, just the one? Well, yes, but it’s a big deal. The power wiring for the new rack had to be complete before the first server could move, and that’s complicated. There are a total of 38 (if I didn’t miss any) devices, mostly servers, that require power. Further, their power is divided into multiple tiers.

The first is the mains tier: that’s “just” the dedicated regular 115 VAC power from the building. It comes from solar, when the sun is shining, a Tesla Powerwall battery when the sun goes down, and if the Powerwall is down to reserve level but the grid is up, then from SMUD. In an ideal world, that would be sufficient, but there are complications so that in reality the only thing directly connected to the mains tier are three uninterruptible power supplies (UPS), a Wi-Fi controlled power switch, and the Wi-Fi interface for the solar/Powerwall system.

The second tier is the shed[able] tier. That’s because, in the event of a prolonged power outage (blissfully rare around here) we can live without most of the servers. We need phones, faxes, texting, our websites, Sacdoc, and domain name systems [DNS]. We don’t need to be playing Minecraft or running our automated backups or doing system integrity checks or running the local status display or keeping our redundant failover systems running. So this tier can be remotely powered off. From a full Powerwall charge, we get roughly nine hours of backup if everything is running vs. almost 24 hours (if the sun shines enough, we can go indefinitely) running the minimum. Now, I don’t keep the Powerwall fully charged all the time, as it’s not good for it (and at $8,000 a pop I’d like it to last a while) and it’s wasteful, as all that stored energy can be used and save me electricity costs, so I constantly adjust the reserve in the Powerwall based on time of day, predicted weather, and predicted load (aircon messes everything up!) to try to use as much battery power as possible while still having enough reserve to last until the next sunny day should SMUD go away. So the ability to shed load is pretty critical, and one of the main drivers of this move to the new server rack.

[More power management trivia: if the grid stays down, there is no sunshine, and I run the Powerwall completely down, then the solar array won’t start itself when the sun comes up. It needs a little energy in the Powerwall battery to prime the pump or I’m down until grid power comes back. So when there are nearby storms or fires (like now), I keep the battery completely charged since the likelihood of a long-term outage is increased.]

The third tier is the UPS tier. While the mains power shifts seamlessly between solar, grid, and Powerwall several times a day, in the event of an actual grid failure, the power can glitch enough before switchover to kill any connected computers, so we still need UPSs. This tier handles mission-critical servers (phones, web sites, DNS, SMS texting, faxing, the fiber optic cable interface, the data switches, the two routers, the premises alarm, and one of the Wi-Fi access points). So those services will keep running for about an hour even after the solar array, SMUD, and the Powerwall are all down.

Since you’re just dying to know… what happens after that hour? I have not one, but two gasoline-powered generators that can handle the server farm, though they have to be hauled out of storage and manually connected and started. Four times a year, I do a drill where I get them out and start them up just to stay on top of things. There’s a problem, there, however. I only have about two gallons of gasoline in reserve, and I dislike storing any more. I used to rely on having gas I could siphon from the Honda Civic, and right now I can still do that, but if Matthew takes the car to SD then I won’t be able to do that anymore. I’m working on a way that I could power a minimal server configuration from the Tesla Model S, which would give us several more days of reserve. I’m thinking that by that point, we’d having bigger worries than keeping our internet services up.

One last issue that probably keeps you up nights with worry but that I have covered: HVAC. If there’s a winter storm and a tree falls on a SMUD substation (even if nobody’s there to hear it), the servers can actually generate enough heat to keep the server room, if not comfortable, at least not breath-foggingly cold. But what if a midsummer heat wave brings down the grid? The servers (and the operator) will need some form of cooling. The Powerwall can’t independently power the building’s central air (at least not for long enough to matter), but things can stay pretty hot even after the sun goes down. So the server room actually has a small stand-alone air conditioner that can keep things from boiling over without using up all the power reserve.

So that’s what it takes to keep the lights blinking and the data flowing.