Suppose your service is hosted on AWS and was affected by last week's outage. If I am reading Amazon's SLA correctly, you would probably be eligible for a refund of 10% of your April bill. Is this supposed to make you feel better? Go read Heroku's excellent post-mortem writeup. It's a refreshingly candid assessment of the impact on Heroku's staff and customers, the response required, and the steps they'll be taking to avoid a repeat. Here are some highlights:
- (Many? Most?) Heroku-based applications were down for 16 to 60 hours. "In short: this was an absolute disaster... we never want to put our customers, our users, or our engineering team through this again". This has to be a major black eye for Heroku in the eyes of many of their customers.
- Heroku engineers were working around the clock for days to restore service. Anyone who has been in a similar situation knows that this takes a major toll on the team; other Heroku projects will be set back significantly.
- As a direct result of this incident, Heroku will be undertaking significant architectural changes, such as spreading data across multiple AWS regions. This is likely to be a major undertaking, again detracting from other projects.
Does a 0.8% discount on your annual infrastructure bill compensate for this? Obviously not. The PR impact alone is probably more than Heroku's entire annual infrastructure costs.
[Note: I do not mean any criticism of Heroku; they were dealing with a major disruption to their underlying infrastructure. A disruption that was significantly worse because it crossed AWS Availability Zones, something that Amazon implies should not occur, barring large external events.]
So, Amazon's SLA is not strong enough to adequately compensate for the impact of a major outage. Is the problem unique to Amazon? Let's look around. Under the ("proposed draft") SLA for Google's App Engine, I believe this incident would have fallen in the "99.0% - 95.0%" uptime bucket, yielding a 25% credit for the month. The "Windows Azure Compute Service Level Agreement" (sheesh! Microsoft still pays their writers by the word, I see) also appears to call for a 25% monthly credit. The Rackspace Cloud SLA is more aggressive; the credit for an incident of this scale could be 100%, depending on exactly how it's interpreted. However, that's 100% of a single month's infrastructure fees; still small potatoes, compared to an honest evaluation of the overall impact on Heroku.
Suppose there was a hosting provider that provided a full year's worth of credit after an incident like this. Would that suffice? It would certainly take some of the sting out of an outage. But I don't think it's a practical approach. First of all, it's not clear that even a year's infrastructure credit really compensates for an incident that may infuriate a large swath of your customers. Second, it's simply too much risk for the hosting provider. If Rackspace had to refund a year's revenue, it might sink them. If I'm hosted on Rackspace, having them go out of business after an outage might carry an appealing scent of justice, but it certainly doesn't make my life any easier.
If not SLAs, then what?
If SLAs are hopeless as a mechanism for judging the robustness of a service provider, what are we left with? Two things: track records and transparency.
The best way of evaluating a service's robustness is by looking at its track record. If it has been stable for several years, that's a good indication that it will continue to be stable. Of course, this approach isn't much use for evaluating young services... but that's practically a feature: young services almost always have some kinks to be worked out, so the lack of a track record correctly tells you not to trust them too much.
Proper evaluation of a service's track record takes some care. Common metrics such as "mean latency" or "uptime percentage" can mask all sorts of problems. For example, an incident causing latency to spike 5x for 10% of a service's customers for three days might have a huge impact on those customers, yet mean latency for the service as a whole would rise only 5% for that month. Better would be detailed histograms and percentile measurements of a variety of metrics, over a variety of time scales, and reaching back for a year or more. I'm not aware of any major hosting provider that publishes such information today. (http://status.aws.amazon.com/, for instance, provides little more than green/yellow/red icons per service per day, and only goes back 35 days. http://code.google.com/status/appengine is significantly better -- there are actual graphs, and the data seems to go back fairly far. But still no percentiles, no histograms, and no ability to roll up data across a long time range.)
Another critical mechanism is transparency -- service providers should explain how their service works under the hood, so that customers can evaluate the likelihood of various scenarios. Lack of transparency was one of the major problems in the recent AWS outage. Amazon states that Availability Zones are "engineered to be insulated from failures in other Availability Zones", but it appears that this insulation failed in last week's incident. Why? We don't know. It may be that, if Amazon had published more information about their service architecture, customers like Heroku would have been able to make a more informed decision as to the safety of relying on a single AWS region.
The clearest statement I'm aware of regarding the isolation of Amazon's Availability Zones relates to physical infrastructure: "Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone." (Quote from the EC2 FAQ.) In our modern age of complex multi-tenant services, most outages stem from software bugs, load surges, configuration errors, or other "soft" problems, not hardware failure. So, merely knowing that Availability Zones have independent hardware doesn't tell us much about the probability of correlated failures.
What can you do?
When evaluating a service or hosting provider, push hard on them to provide you with detailed data on their track record for availability and performance. Ask them to publish it openly.
Assuming your provider has an SLA, ask them why they believe in it. How is their service architected? How does this architecture support the SLA?
Gather as much data as you can regarding the infrastructure you're using, and publish it. Leaving aside the benefit to the community, you'll be surprised at how much you'll learn about your own systems and how to improve them. (I've done some work in this regard, which I've blogged about in the past. If you want to poke around, go to http://amistrongeryet.com/dashboard.jsp.)
A modest proposal
Expanding on the previous point... imagine if a large sampling of AWS customers (and App Engine customers, Rackspace customers, etc.) were to publish latency and error rate statistics. Imagine further if there were a central repository aggregating and analyzing this data. We'd no longer be dependent on the hosting services themselves for information.
This would be a nontrivial undertaking, to say the least, but the benefits could be huge. A topic for a future post.