Amazon said the S3 team was working on an issue that was slowing down its billing system. Here’s what happened, according to Amazon, at 9:37 a.m. Pacific, starting the outage: “an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
Those servers affected other S3 “subsystems,” one of which was responsible for all metadata and location information in the Northern Virginia data centers. Amazon had to restart these systems and complete safety checks, a process that took several hours. In the interim, it became impossible to complete network requests with these servers. Other AWS services that relied on S3 for storage were also affected.
About three hours after the issues began, parts of S3 started to function again. By about 1:50 p.m. Pacific, all S3 systems were back to normal. Amazon said it has not had to fully reboot these S3 systems for several years, and the program has grown extensively since then, causing the restart to take longer than expected.
Amazon said it is making changes as a result of this event, promising to speed up recovery time of S3 systems. The company also created new safeguards to ensure that teams don’t take too much server capacity offline when working on maintenance issues like the S3 billing system slowdown.
Amazon is also making changes to its service health dashboard, which is designed to track AWS issues. The outage knocked out the service health dashboard for several hours, and AWS had to distribute updates via its Twitter account and by programming in text at the top of the page. In the message, Amazon said it made a change to spread that site over multiple AWS regions.
Amazon concluded its explanation with this message:
Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.
Several observers surveyed by GeekWire pointed to the need for redundancy in cloud storage as a key takeaway from the outage. Redundancy in this case can mean spreading data across multiple regions, so that an outage in one area doesn’t cripple an entire site, or using multiple cloud providers.
Anand Hariharan, vice president of products for Mountainview, Calif.-based Webscale Networks noted that Amazon’s retail website didn’t go down during the outage Tuesday because it didn’t put all its eggs in one cloud basket.
As AWS’ incredibly disruptive outage this week showed, every major public cloud provider has experienced – or will experience – downtime. In fact, more and more of our customers – particularly those running e-commerce businesses – recognize that they can’t just rely on one cloud provider, or one region. Amazon themselves stayed live and fast because they do exactly this – spread their infrastructure across multiple regions. Hours – and really just minutes – of downtime are a lifetime for businesses. Downtime costs not only revenue, but brand reputation and consumer trust, so companies need to consider their multi-region/multi-cloud strategies today.
The internet reacted in a pretty jovial manner to the outage Tuesday, with many taking the outage as a chance for a “digital snow day.” Amazon’s explanation of the outage earned praise from some for the company’s transparency and scorn from others.