The Good, the OK & the Ugly of Cloud Architecture

The four-day-long Amazon Web Services (s amzn) outage was no doubt a traumatic experience for many customers, but that didn’t have to be the case. Although hundreds of sites were affected for prolonged durations — from unknown small companies to Reddit to the New York Times (s nyt) — there also were plenty that stayed up, were minimally affected or that acted in a hurry to resolve the problem.

By now, it should have been drilled into everyone’s heads that they need to architect for failure if they want guaranteed high availability from their cloud computing efforts, so here are a few stories that might provide some food for thought on how to do that. Chief among that advice: Use Elastic Block Storage at your own peril.

The Good

Twilio. Twilio explained how its system is designed to withstand the failure of a single host, and then elaborated on its use of Elastic Block Storage, specifically, which was the service most affected by the AWS outage. According to Twilio CTO Evan Cooke:

We use EBS at Twilio but only for non-critical and non-latency sensitive tasks. We’ve been a slow adopter of EBS for core parts of our persistence infrastructure because it doesn’t satisfy the “unit-of-failure is a single host principle.” If EBS were to experience a problem, all dependent service could also experience failures. Instead, we’ve focused on utilizing the ephemeral disks present on each EC2 host for persistence.

SmugMug. SmugMug’s Don MacAskill detailed how SmugMug designed its AWS infrastructure to be as distributed and redundant as possible, and, like Twilio, how it avoids EBS because of potential performance problems inherent in the service. But one of SmugMug’s biggest aces in the hole was that it’s not entirely cloud-based yet and, as MacAskill noted, all the data that might have been affected by the EBS issues is still hosted locally “where we can provide predictable performance.” That was a blessing in disguise — this time around:

This has its own downsides – we had two major outages ourselves this week (we lost a core router and its redundancy earlier, and a core master database server later). I wish I didn’t have to deal with routers or database hardware failures anymore, which is why we’re still marching towards the cloud.

Netflix. Netflix (s nflx) hasn’t officially explained why it was able to keep up and running during the outage (it promises a detailed blog this week), but Netflix cloud master Adrian Cockroft did confirm on Twitter last week that the company is spread across multiple Availability Zones, which is the baseline strategy for high availability within AWS. Although this outage downed multiple Availability Zones in the US-EAST region, a couple were back online relatively quickly, leaving users contained to the one problem zone reeling for days. But, in the light of Reddit’s prolonged EBS issues, Cockroft did explain the perils of EBS a few days before the outage, and detailed Netflix’s strategy for using EBS as effectively and as safely as possible.

The All Right

Zencoder. Zencoder, a video-encoding service, was down for much of the day on April 21 because of its reliance on EC2 and EBS, but it recovered in a hurry thanks to some quick thinking. Namely, it stood up a backup application environment outside of AWS. As CEO Jon Dahl wrote:

We redeployed our application to a new set of servers, outside of EC2, and helped transition customers to this backup environment. We didn’t immediately point our core application to this new environment, to ensure that we had application and data integrity, but we got many of our customers up and running on this new environment mid-afternoon yesterday. When we could, we brought our main site back online using these new servers.

In the end, Zencoder didn’t lose any data, so the biggest problem its customers faced was high latency to complete pending jobs. Going forward, though, Zencoder is moving its API off of AWS to ensure operations if AWS experiences problems (although it still will use EC2 for encoding), is setting up a backup environment that it migrate to seamlessly, and is generally improving its “internal processes and planning when dealing with a catastrophic outage.”

DotCloud. For DotCloud, spreading its Platform-as-a-Service offering across multiple Availability Zones helped prevent a complete disaster despite the company’s reliance on EBS. As Jérôme Petazzoni explained on April 21, only about 25 percent of DotCloud users were affected by the outage, but the company decided to pull its API proactively service-wide as it assessed the situation. The problem, Petazzoni noted, was that users and scripts trying to recover from the outage were “flooding API endpoints” with requests. On April 22, DotCloud’s service was back up and running except for the customers whose EBS volumes were hosted in the still-affected Availability Zone. Hykes vowed a new strategy to protect against this type of situation in the future, which one has to assume will involve decreased reliance on EBS and perhaps multi-region, not just multi-Availability-Zone, deployments.

PagerDuty. PagerDuty actually represents both good and all right practices for deploying applications in the cloud. PagerDuty is a cloud service that monitors customers’ systems and alerts them immediately when problems arise, and according to a blog post by John Laban, the company had sent alerts to 36 percent of its customer base as of April 22 — all related to the AWS outage. Whether they choose PagerDuty or something else, cloud users have to utilize a service or product to monitor the performance of their cloud-based systems and alert them when an issue hits. The sooner a business knows about a problem, the sooner it can start working on a solution to mitigate the damage.

But PagerDuty is actually an AWS customer, too, and it appears to go the extra mile with its architectural practices. In fact, the company is so dedicated to providing its service that it might move off AWS entirely just to ensure availability when problems like this happen and its customers start flooding AWS with new instances, thus affecting everyone’s capacity. Wrote Laban in the comments section of his post:

We actually host on AWS as well. We use multiple availability zones, have multiple replicas and extra redundancy, we over-provision, etc, etc.

To be honest, we’re thinking of moving off of AWS just because of the proportion of our customers that use it: when AWS has issues with even one availability zone, it means that most of their customers will spin up instances in the other availability zones, which removes capacity for everyone else, including us. This just happens to be the very time when we’re at our peak request rate, again, because AWS is having issues and causing problems for our customers, who in turn call us.

RightScale. RightScale suffered from the outage, and, as CTO Thorsten von Eicken explained in a post Monday morning, it did take some time for the cloud-management service to determine what was going on and then act accordingly. Failover time also needs to be improved, von Eicken noted. But it was able to recover relatively quickly once the EBS issues were contained to one Availability Zone and, at the very least, it has designed its service to withstand many outage scenarios (although von Eicken notes that the design of the EBS service did create some unforeseen problems). Consider the best practices against outages as explained in RightScale Support FAQs:

The Ugly

Anonymous heart-monitoring company. On Friday, an anonymous company that monitors “hundreds of cardiac patients at home” posted on the AWS forums that it has been unable to read electrocardiogram measurements for its patients for more than a day. There was little discussion in its explanation of how it architected its AWS infrastructure, but that’s not really necessary. Anyone running a truly mission-critical or, in this case, life-or-death application in the cloud has to take absolutely every step necessary to ensure it remains available. Maybe that means running redundant copies of applications and data in AWS regions across the country, or perhaps not using AWS at all in lieu of an “enterprise-grade” cloud such as Terremark’s (s vz) aptly named Enterprise Cloud. Maybe that means not using a public cloud at all, at least to spare any potential negligence lawsuits should an outage occur and the decision to use cloud computing be called into question. These options will almost certainly be more expensive and require more legwork, but those are small prices to pay in the end should something go wrong.

For everyone else that uses any cloud, whether they we’re affected by the outage or not, the AWS catastrophe should just be a learning opportunity. Assuming no real damage was done to data or business, figure out what you can do to mimic those that came out of this outage unscathed and do it. Cloud outages of this scale don’t happen often, but they’re inevitable.

Image courtesy of Flickr user stevendamron.