The following is a true story from Derrick Miller, now a customer engineer at Google Cloud. We collected this and other true tales of survival in The Engineer’s Survival Guide: Expert advice for handling workload (and work-life) disasters, just published by Cockroach Labs. This free book, deftly and hilariously illustrated by Giovanni Cruz, offers top tips from experts for surviving your job, surviving the workplace, and surviving whatever comes next (which, these days, could be anything). It was October 29, 2012. I was working for Fog Creek Software at 75 Broad Street in Lower Manhattan. Squarespace and Peer 1 Hosting were in the same building, and we all shared an on-prem data center.
Unfortunately this was our only data center at the time, because we had just decommissioned our DR location in an effort to move everything to the cloud. We had tested our code and infrastructure services on AWS, but we had yet to upload our vast amounts of data. Meanwhile, we’d been watching the news about the hurricane heading our way, and trying to gauge how it’s going to affect New York. There were a lot of unknowns: what’s gonna happen, how bad is it going to be? And the biggest, most concerning unknown we had was, what are we going to do if the power goes out and stays out?
If you are a developer reading this, you probably recognize Fog Creek as the software company that created Trello and Stack Overflow, among other products. So, software as a service, mainly, which is what was running in our data center. (On-prem installations were also possible, but obviously that’s up to the customer for their uptime). As a SaaS provider there’s a lot of reputation that goes into the availability of your system. and people trusted us with their data for their systems.
The immediate revenue impact of an outage wasn’t our main concern. If our data center went down, we’re gonna lose a little bit of data from when we took the backups, but we’ll still have most of it. No, the real worst case scenario was going to be the loss of reputation, the loss of customer trust, which can have a very real and long-term revenue impact.
So, yeah, it was terrible timing to have literally just shut down our disaster recovery facility. But we felt like things would probably be ok since the building had backup generators.
Day zero: October 29th
It became obvious that this was going to be a major hurricane. Airlines canceled flights, and New York City closed schools and suspended subway, bus, and commuter train services. Even the stock exchange closed.
We all were sent home from the office early that day, told to work if we were able, but mostly make sure you’re ready. We had a couple meetings that day and that night, preparing for the worst, and we started an all-hands chat to get everybody in order. Communicating updates about what has happened so far, and what we’re going to do about it?
And then Superstorm Sandy hit.
Day one: October 30th
Eventually, late that night, we heard that the building was compromised by the storm. Specifically, the lobby had filled with water, chest high.
So, yes, the building had backup generators – but they were in the basement, along with the tanks holding diesel to fuel them. When the storm hit, lower Manhattan flooded and this basement was completely filled with water. Long story short, the generators could not go online due to being under water.
When the storm passed, I was the first person to arrive at the building to participate in keeping the data center alive. The water had one down enough by then that we could actually go about a half story down into the basement. But there was still several feet of water mixed with diesel fuel. The smell was absolutely horrific. Just the fumes, it was so bad, you could barely breathe. Like, nobody light a match, for real. No way these generators were going online any time soon.
But: the data center was still online! The reason it survived is it had its own backup generator — and this generator was not typical, because it had its own small fuel tank. This was an incredible piece of luck.
The folks operating the data center had been there 24/7 since before the storm hit, sleeping on the floor. They were communicating with all of us DC users about how much fuel was left. Everybody was trying to lower usage in the data center to preserve fuel and extend the running time as much as possible, but even so the generator was running low on fuel. It was only a matter of time. But then a miracle happened. Actually, a couple of miracles.
The first was that somebody from one of the other big customers in this colo data center was somehow able to secure a load of diesel fuel. In the middle of a destructive hurricane, when diesel is like gold. They just put it on their personal Amex for $25,000. Second miracle was that though the streets were still flooded in lower Manhattan, the water had gone down just enough that the truck could actually make it to our building. I guess there was a third miracle actually, that they were also able to round up some empty 55 gallon drums. And so the truck came and unloaded all the diesel fuel into these drums.
Enter the bucket brigade
All these things were incredible pieces of luck. But that is when we realized the unlucky part: The data center generator was at the top of the building, the roof on the 17th floor.
Somehow we had to get the diesel out of those drums and up to the generator. Needless to say, the elevator was not an option. The only thing we could do was to empty the drums into 5 gallon buckets, like the kind you get at Home Depot. And then carry the buckets up 18 flights of stairs to the data center and pour it into the fuel tank of the generator. Then go back down and do it again. And again. And again.
By now more people had come in to help, including engineers from Stack Overflow. We set up shifts of people doing this bucket brigade, nonstop, for the next 24 hours.
A gallon of diesel fuel weighs about eight pounds. So you were carrying two buckets, each weighing 30 to 40 pounds, up 18 flights of stairs. We didn’t have any lids for them, so the fuel slopped out and the stairwell got very slippery and filled with diesel fumes.
Keep in mind that power was still out in lower Manhattan. We had lights barely working in the data center, but in the main part of the building it was zombie town, completely empty. It was eerie. The stairwells were apocalypse pitch dark. The only light was your headlamp or flashlight. After my first shift was done I managed to get a cab home to the Upper East Side. Another miracle, most transit still wasn’t running. I could have walked all the way if I had to, but I was covered in diesel fuel, feeling incredibly sick from the fumes, and I couldn’t wait to get a shower. When she opened the door my wife was like, we are not even going to try washing those clothes. So we double-bagged them and threw them in the trash.
Since it had not flooded this far up the island we still had power and hot water. So during that time and into the next day we became a refuge for the people who lived in the lower part of Manhattan. People my wife worked with at Stack Overflow, people I worked with at Fog Creek. We had folks come to our place so they could shower and charge their phones in between bucket brigade shifts, even do a little bit of work or at least help out with everything else going on. I ended up putting in a second shift in the data center. Started late and worked through the night. I told myself it was like a free Crossfit workout. Except of course for the toxic fumes.
Meanwhile, back at the data center…
Simultaneous to staffing the bucket brigade that was keeping the data center alive, our Fog Creek team also had our last-ditch emergency backup strategy going. There was no way to know how long the power grid would stay down, or how long we could keep the data center alive if it didn’t come back soon. We already had a contract with AWS for hosting the backup. However this was just vast amounts of data, which you can’t upload quickly over a typical internet connection. The only solution was to manually copy all this data to external USB drives and then take them, in person, to an Amazon data center in Virginia. It literally had to be in person because their policy was the only way they would receive these disks is via courier. No shipping.
So next we had to figure out, first, who’s got a car? Not many people in Manhattan have one. Then, how much gas does the car have, how far can you go? Sandy struck the whole east coast and the gas stations that weren’t closed were mostly sold out.
Lessons learned
The story has a happy ending. The situation gradually resolved over the next day or two. The data center stayed alive through one of the most destructive storms in history. We got the data drives on their way to Virginia. Eventually they got the basement pumped out and everything back on and in working order in the building. Even so, I think it’s embarrassing that we kind of put ourselves in an emergency situation to begin with by having that gap in our disaster recovery planning, cutting our DR center too early for cost savings. Embarrassing to just needing to keep sending warning messages out to our customers, Hey, we might be going down.
As for me, I learned a big lesson about having an active disaster recovery plan at all times. A lot of the more modern Devops practices have this thinking already in place. Define what you want your user experience to be in the event of an outage. What are your goals for latency and traffic errors and saturation? You can add additional structure or design your application architecture to be resilient and meet those requirements. My advice is, definitely don’t think of availability later. Build it in. Otherwise, what’s gonna happen if the worst case scenario does go down? Are you ready to take that risk? Because you never know when the next super storm is coming. Also: if you are ever buying five gallon buckets, make sure to get the lids.
Feature photo: Associated Press