"Everything fails all the time."
- Amazon's VP and CTO Werner Vogels at the NextWeb conference in
Paris Amsterdam about two weeks ago
"Werner was right."
- Me, this past weekend
This past Friday night around midnight, Mashery began to experience what appeared to be some intermittent connectivity issues with some of our Amazon Web Services (AWS) Elastic Computing Cloud (EC2) instances.
This gave us an opportunity to test our "fail over" infrastructure at Limelight. It worked.
It also gave us an opportunity to test AWS's brand spankin' new "Gold" support package. It worked too - we were put in immediate touch with people who knew what they were doing and could work collaboratively with us to isolate the cause and solve the problem. Anyone who uses AWS for critical infrastructure and is not on the Gold plan should sign up today.
But given Mashery's reliance on Amazon Web Services infrastructure for what we do, and the strong relationship we've been able to build with the AWS team, I thought I'd share a bit on what happened, why, and how everyone reacted to it. Everything fails. Minimizing the frequency, duration and impact of those failures, and learning from the failures that do happen, is of the utmost importance. After this weekend's problems, I remain a big fan of AWS, but I know that they, and we, learned some valuable lessons and gained an important data point that will help us minimize future outages.
Cloud computing - it's great, but it's evolving
For those of you not schooled in EC2 and "cloud computing", a brief primer:
Elastic Computing Cloud is just what it sounds like - a "cloud" of virtual servers that are available to us on an ad-hoc basis. With EC2, we are able to add and subtract virtual servers ("instances", in their parlance) as needed to meet demand in real time.
This can be a great advantage when load spikes, such as when our client Reuters was fortunate enough to get slashdotted - we brought ten additional servers on line in a few minutes just in case we needed more capacity, kept them online during the period of peak load, and returned them to the cloud when things had subsided.
Each piece of physical hardware (a server or, as AWS calls it, a "host") can be, in essence, divided into multiple virtual servers. So one computer with 8 gig of RAM can be made to function as 4 virtual machines, each of which a little less than one fourth of the processing power and a little less than 2 gig of RAM (since some of the memory and processor cycles are needed to run the software that makes the virtual machine magic work). Network infrastructure between physical (and therefore virtual) machines are shared, and everything costs less.
And, of course, we get to do all of this on a "pay as you go" model - our entire cost for the increased infrastructure during the Reuters load spike couldn't buy you a latte at Starbucks; our entire cumulative cost of servers, storage and bandwidth since we founded the company has been a small fraction of the salary we'd have paid one of the two or three sysadmins we'd have to have hired to run our server cage if we had a cage
We've been able to build and scale Mashery while taking in less than half of the capital we'd have needed if we were buying and operating our own infrastructure.
Sounds great in theory. It's pretty great in practice, too - many companies (Mashery included) are able to share infrastructure in a way that allows each of us to use the resources we need when we need them, and let others use them when we don't. Cheaper for us, better for the environment - all good.
One of the biggest challenges in virtualization, though, is inter-instance security and resource allocation. In theory, instances - even those on a shared host - are completely isolated from each other. One virtual instance should not be able to interfere with or impact the performance another - even if they are on the same hardware server, and even if one instance is getting hammered by traffic or other resource-intensive work. The closer your virtual server solution can get to perfection in this realm, the better and more secure your service. Solving this challenge needs to be one of the core competencies of any company (like Amazon) providing cloud computing infrastructure as a service.
It's like living in an apartment building - if you share the same kitchen vent, steps need to be taken to make sure that my pungent dinner doesn't leave your apartment smelling of garlic. And though we might share a lobby, the outside visitor should not really notice; to your visitor, the lobby is clearly yours, and to mine, it's mine. When I take a 30 minute shower in the morning, I shouldn't use up all of your hot water. And so on.
Given the complexity of data center security, the extent to which AWS is able to prevent inter-instance issues is impressive. Impressive, but, as it turns out, not perfect.
What caused the outage?
Well, as Amazon said in the forum where it provided some information on the outage:
This performance issue, affecting a small number of instances in a single availability zone, was the result of a customer applying a very large set of firewall rules while simultaneously launching a very large number of instances. The high volume of firewall rule changes, combined with an usual rule configuration, exposed a performance degradation bug in the distributed firewall that lives on the physical hosts. The issue has been resolved. In addition, we are also increasing the density of our monitoring to detect and isolate issues in this area of our infrastructure more rapidly.
What does that really mean? Each "host" is a physical machine that hosts several instances. Since it's one physical machine, many resources, such as the network interface that connects the machine to the internal network and, ultimately, the Internet, are shared by all the instances on that machine. Since each instance is performing a different set of tasks for a different customer, each customer will configure that instance differently. One essential configuration of any server (physical or virtual) is its firewall - not a physical firewall appliance, but the software-defined rules that dictate what will and will not be allowed into that particular machine. Each of the instances on the host is allowed to set its own firewall rules, and the common network interface will implement those rules for the traffic coming to that instance.
So some other AWS customer with whom we were sharing one or more instances decided to do a load test of some sort that involved changing a bunch of firewall rules while starting up and configuring a bunch of new instances (probably to simulate how they would react to a spike in traffic that would need more instances to handle) and throwing a lot of traffic at the instances that were up exposed an issue that AWS had not tested for before, and the result was that the other instances on some of the hosts (several of which were being used by Mashery) found their network interface intermittently unable to connect to the network.
"Availability zones" is AWS-speak for "physically separate data centers", so they are saying that this particular test only impacted a portion of one of the several EC2 data centers.
How was the root cause diagnosed?
In response to many customer requests and a major EC2 outage a couple weeks ago, AWS recently rolled out a "Service Health Dashboard" which was intended to provide customers with a snapshot of current and recent service outages or degradation. The Red/Yellow/Green presentation is meant to be simple, which it is. But its simplicity severely limits its utility. Looking at whether an entire service is up, or even an entire availability zone is up, will miss instance-level issues like the one we experienced, and provides no visibility into root cause, estimated time to resolve, or steps one can take to work around the outage. So when we noticed that we were having issues, the Service Health Dashboard still showed "all green".
As it turns out, Amazon's internal monitoring did not detect the issue either. We reviewed recent configuration changes and releases, and determined that none had been made in the time leading up to the issue. We all wanted to be extra-careful - it is easy to point fingers during an outage, but the fact that no one had posted about the outage on the AWS forums or on any blog made it appear more likely to be a Mashery issue than an AWS issue. But as I learned many years ago, if you systematically rule out everything that is impossible, then whatever's left, however improbable, has to be the issue. Only after AWS and Mashery teams went through server log files (our instance-level logs and AWS's host-level and network logs), and Mashery's own 24x7 monitoring powered by Webmetrics, was the problem discovered. Just as Clay and the Mashery team reached the conclusion that it had to be an AWS issue, the AWS team contacted us and let us know that they had isolated the cause and were applying a fix.
The major lesson learned
Werner's right - everything fails all the time.
Although the root cause of this particular issue was a resource contention issue between instances, things like that are going to continue to happen. There may now be a fix for this particular edge case, but there are undoubtedly others that will crop up over time.
The real failure here was a failure of monitoring, and a failure of transparency.
Amazon.com's own massive architecture is built on a lot of redundant hardware and software, with redundancies built into redundancies. Things break; wounds are routed around and ultimately heal themselves. This model is much easier to build when you have visibility into, and control of, network infrastructure. But when all you control are the individual servers and the DNS records that point to them, you are operating at a distinct disadvantage. AWS's new Elastic IP Addresses are a step in the right direction - a big improvement from what we had before, as I wrote previously - but it is a small step on the path toward a fully outsourced web-application-infrastructure-as-a-service. By definition, such a service needs to provide control over network infrastructure that is analagous to the "root access" we have to the EC2 instances themselves.
Until (and even after) we reach network control nirvana, AWS needs to add three elements to their monitoring:
- Amazon needs to pay attention to monitoring on the instance level, rather than on the host or network level, to ensure that even though a host might look "healthy", one of its instances may be having difficulty. Amazon appears to be recognizing this need - that would be the "increasing the density of our monitoring" mentioned in their statement.
- Amazon needs to recognize that even an anomaly like the one from this weekend, which only affected a few customers in one part of one availability zone, is an "outage". As commenter "mikklr" noted, "Why doesn't the Service Health Dashboard reflect these failures?" Good question.
- Amazon needs to give their customers more visibility into what is happening in the network, and in fact in the hosts that are running our instances. It is not enough for them to say "well, if you're having a problem with an instance, shoot it, and spin up a new one" when we don't have the network control to achieve immediate and reliable cutover. And it's not enough for them to say "well, if you have a critical task being performed by a particular server, run that as an Extra Large Instance so it is not sharing hardware with other instances" when an extra-large instance carries an 8x price tag over a standard instance. For every instance we run, we need to not only be able to monitor the health and load of our instance, but also monitor it for the host at large, so we can see if there is something happening on our physical host that might be causing an issue. It's simple - when diagnosing a problem, more data and more visibility is better.
On Mashery's side, we will to continue to work with all of our infrastructure vendors - AWS, Limelight, Webmetrics and Rightscale - to make sure that we're taking advantage of all the opportunities they can provide to make our service as redundant and fault-tolerant as possible. In other words, we leverage the power of virtualization and cloud computing to allow a small startup to provide the kind of reliability that previously required the infrastructure, in-house expertise and massive capital outlay that only large companies could afford.
All four of these companies have been very responsive. AWS mobilized top engineering, support and management resources to find and fix the problem, and then took the time to listen and react to our questions and concerns. Limelight has brought additional POPs online for us when we had some concerns over implementation speed. Webmetrics helped us reconfigure our monitoring in a way that will help isolate and identify issues similar to the one we had this weekend. And Rightscale, our newest partner, is helping us automate some of the failover, redundancy, load balancing and backup tasks that we used to either do manually or have to maintain our own scripts to do.
AWS and cloud computing allow us to provide a scalable, economical service that meets our customers' ever-increasing load and need for mission-critical reliability with a solution that combines our API and domain expertise and 24x7 service with the best of the expertise, capitalization and sophistication of a company like Amazon. Everyone provides what they're best at, and the customers win. It's a great business.