What's been missing from the otherwise excellent coverage of yesterday's 365 Main power outage is that this is not the first major power outage that 365's customers have had to endure. What makes their suddenly-missing press release even more inappropriate is that last February there was a total power outage in Colo 8...and that one could not be blamed on PG&E. One of the commenters on Techcrunch mentions it...I was actually in the colo when it happened.
I had just locked our cage and was heading out of the colo when all the lights went out for about half a second. They came back fast enough that at first I though I was imagining things - power doesn't go out in a colo, right? But the sudden beeping of alarms and the unmistakable sound of servers rebooting made me realize that, yes, the entire colo had lost power.
Two months later I left the company whose cage I'd been in; during that time, 365 had not provided any compensation for the downtime. Their excuse - that a single controller board on a Hitec power unit had failed due to some input frequency error - was hard to understand, given the things they told us when we toured the facility. The most expensive part of our data center package was the power - it was never supposed to go out, and in the rare case that something could go wrong, we were assured that our cage was fed from three different power infrastructures, so at worst we would only lose 1/3 of our servers.
That they could issue a press release on two years of uptime with Red Envelope when they knew full well that the only reason that they'd achieved two years of uptime was that Red Envelope had the good fortune to not be in Colo 8 is more than a bit disturbing.
When I asked the head of operations at 365 to sit down with us to map out which servers were getting power from where, so as to mitigate the damage from "the next such incident", I was assured that there would be no "next such incident", so no meeting would be necessary.
The team at 365 has a long way to go to restore credibility. They need to be less enamored of the supposed infallibility of the hardware one sees on their impressive tours (and they are impressive...especially the generator room) and focus instead on a more reliable approach to mitigating outages. Until they can show that they have done so, I'll not be putting any servers anywhere near their facility.
Of course, At Mashery we're avoiding the need to own and operate servers in any colo - that's why we make use of services such as Amazon EC2 and S3.