Amazon blames human error for Xmas Eve outage; Netflix vows better resiliency

Amazon Web Services (s amzn) has issued a a postmortem of its Christmas Eve cloud computing outage that took many services — most notably Netflix (s nflx) — offline for a portion of the night. The cause, according to AWS: A developer accidentally deleted Elastic Load Balancer state data in Amazon’s US-East region that the service’s control plane needs in order to manage load balancers in that region.

All told, the outage (which began at 12:24 p.m. PT) lasted 23 hours and 41 minutes and, at its peak, crippled 6.8 percent of load balancers in the region while leaving others running — albeit unable to scale or be modified by users. The Elastic Load Balancer team didn’t realize the root cause of the problem for several hours, at which point it began the challenging process of attempting to restore the state data to a point in time just before its accidental deletion. At 12:05 p.m. PT on Dec. 25, AWS announced that all affected load balancers had been restored to working order.

AWS says it has taken multiple steps to ensure this situation doesn’t repeat itself, or at least can be resolved faster should something similar occur. The first — and likely easiest — fix was to incorporate stricter access control to production data of the type that had been deleted. According to the AWS report, that’s typically the case, but the company “had authorized additional [Elastic Load Balancer] access for a small number of developers to allow them to execute operational processes that are currently being automated.”

On the technological side, the company had the following to say:

We have also modified our data recovery process to reflect the learning we went through in this event. We are confident that we could recover ELB state data in a similar event significantly faster (if necessary) for any future operational event. We will also incorporate our learning from this event into our service architecture. We believe that we can reprogram our ELB control plane workflows to more thoughtfully reconcile the central service data with the current load balancer state. This would allow the service to recover automatically from logical data loss or corruption without needing manual data restoration.

More shocking than the AWS outage, though — they’ve happened before and will almost certainly happen again — is that the Christmas Eve outage actually took down Netflix, which is often cited as the most-advanced AWS user around. It has a host of homegrown tools built specifically for the purpose of monitoring, managing and adding reliability to its AWS-based infrastructure. There’s a reason even President Obama’s tech team relied on the company’s best practices in order to keep its campaign applications up and running during election crunchtime.

Cockroft (center) at Structure 2012c)2012 Pinar Ozger
Adrian Cockroft (center) at Structure 2012
(c)2012 Pinar Ozger [email protected]

In a blog post on Monday, Netflix cloud guru Adrian Cockroft acknowledged the effects on company’s streaming service and explained how it affected different devices in different ways. Cockroft also provided a mea culpa of sorts, explaining that while Netflix has an impressive track record when AWS outages are confined within Availability Zones, challenges still remain when outages affect significant portions of AWS regions as this one did.

Indeed: Netflix recently survived an outage in October, but was hit by a July outage in which a cascading bug spread across Availability Zones in the US-East region. “We are working on ways of extending our resiliency to handle partial or complete regional outages,” Cockroft wrote.

However, he cautioned, figuring out how to do it correctly will take some work given the complexity of cloud computing infrastructure:

We have plans to work on this in 2013. It is an interesting and hard problem to solve, since there is a lot more data that will need to be replicated over a wide area and the systems involved in switching traffic between regions must be extremely reliable and capable of avoiding cascading overload failures. Naive approaches could have the downside of being more expensive, more complex and cause new problems that might make the service less reliable.

At this point, though, if anyone can figure out build reliable cross-region services on Amazon’s cloud platform, it’s probably Netflix. Although, AWS and other cloud providers will certainly undertake their own work to improve reliability across global data centers, thus making themselves all the more appealing to potential customers.

We’ll have to wait to see how the lastest in a string of 2012 outages for AWS affects CIO sentiment toward the cloud or if, like Cockroft, they’ll take the mindset that’s “it is still early days for cloud innovation” and there’s plenty of time to fix these difficult problems.

Feature image courtesy of Shutterstock user Zastolskiy Victor.