AWS Claims ‘Significant Progress’ on Hours-long Outage

Amazon Web Services (s amzn) claims to be making significant progress in restoring functionality to a large number of EC2, Elastic Block Storage (EBS) and Relational Database Service (RDS) instances that went down due to a “networking event” in the early-morning hours and affected a number of popular web sites.

According to its most recent updates on EC2 and EBS:

10:26 a.m. PDT We have made significant progress in stabilizing the affected EBS control plane service. EC2 API calls that do not involve EBS resources in the affected Availability Zone are now seeing significantly reduced failures and latency and are continuing to recover. We have also brought additional capacity online in the affected Availability Zone and stuck EBS volumes (those that were being remirrored) are beginning to recover. We cannot yet estimate when these volumes will be completely recovered, but we will provide an estimate as soon as we have sufficient data to estimate the recovery. We have all available resources working to restore full service functionality as soon as possible. We will continue to provide updates when we have them.

Regarding RDS, which was further behind in terms of progress earlier, AWS is reporting that:

10:35 a.m. PDT We are making progress on restoring access and IO latencies for affected RDS instances. We recommend that you do not attempt to recover using Reboot or Restore database instance APIs or try to create a new user snapshot for your RDS instance – currently those requests are not being processed.

It’s a long outage, especially by AWS terms, but it underscores the importance of having disaster recovery and failover plans in place when using public cloud resources. In many ways, cloud computing providers are learning as they go, too, and new issues will be resolved, but they might take some time.