AWS on Outage: Network Overhaul and Service Credits Coming

Amazon Web Services (s amzn) released its official statement and apology regarding the four-day outage that began on April 21, vowing, as expected, to improve the design of its Elastic Block Storage service and generally improve availability across the AWS platform. I predicted as much last week, but I missed the mark elsewhere: Despite no contractual duty to do so, AWS is giving affected customers service credits that expand beyond even the duration of the outage.

AWS’s explanation is heavy on details — more than 5,000 words of them — but the gist is that a standard networking change was carried out incorrectly, which kicked off a chain reaction that resulted in a “stuck” EBS cluster within a single Availability Zone. As has been speculated over the past week, the design of the EBS control plane exacerbated the problem, leading to a situation that “affected the ability to create and manipulate “across Availability Zones.

Or, if we think about the AWS network as a highway system, the effects of the outage were like those of a traffic accident. Not only did it result in a standstill on that road, but it also backed up traffic on the on ramps and slowed down traffic on other roads as drivers looked for alternate routes. The accident is contained to one road, but the effects are felt on nearby and connected roads, too.

AWS promises it will re-architect the EBS service to avoid a recurrence of the factors that led to this prolonged outage. It also will expand the multi-Availability-Zone option across all its services and will begin a series of customer-education events on architecting for maximum availability. On top of taking technical responsibility for the outage, AWS also swallowed some pride and acknowledged that its communications with customers could have been better throughout the outage and vows to improve in this area.

The most-surprising concession might be that AWS is giving 10-day service credits to “customers with an attached EBS volume or a running RDS database instance in the affected Availability Zone in the U.S. East Region at the time of the disruption, regardless of whether their resources and application were impacted or not.” The credits will be “equal to 100 percent of their usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone.”

With some closure at last, perhaps AWS and customers can make haste on the more-important efforts of improving their operations to fix their architectural shortcomings. I met yesterday with Chris Pinkham, one of the primary architects of Amazon EC2 and now co-founder and CEO of private-cloud startup Nimbula, and he explained that all anyone affected by the outage can do is learn from their mistakes.

These types of situations are bound to happen to anyone working on the cutting edge of technology, he said, and he’s confident that AWS can fix the problem. Even Nimbula was affected because of a registration service hosted on AWS. Pinkham said it’s up to his company to figure why it was relying so heavily on a single set of cloud resources when they’re so inexpensive, and so easy to procure and design for high availability.