Amazon explains big AWS outage, says employee error took servers offline, promises changes

9 years, 4 months ago - March 03, 2017

Amazon has released an explanation of the events that caused the big outage of its Simple Storage Service Tuesday, also known as S3, crippling significant portions of the web for several hours.

Amazon said the S3 team was working on an issue that was slowing down its billing system. Here’s what happened, according to Amazon, at 9:37 a.m. Pacific, starting the outage: “an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

Those servers affected other S3 “subsystems,” one of which was responsible for all metadata and location information in the Northern Virginia data centers. Amazon had to restart these systems and complete safety checks, a process that took several hours. In the interim, it became impossible to complete network requests with these servers. Other AWS services that relied on S3 for storage were also affected.

About three hours after the issues began, parts of S3 started to function again. By about 1:50 p.m. Pacific, all S3 systems were back to normal. Amazon said it has not had to fully reboot these S3 systems for several years, and the program has grown extensively since then, causing the restart to take longer than expected.

Amazon said it is making changes as a result of this event, promising to speed up recovery time of S3 systems. The company also created new safeguards to ensure that teams don’t take too much server capacity offline when working on maintenance issues like the S3 billing system slowdown.

Amazon is also making changes to its service health dashboard, which is designed to track AWS issues. The outage knocked out the service health dashboard for several hours, and AWS had to distribute updates via its Twitter account and by programming in text at the top of the page. In the message, Amazon said it made a change to spread that site over multiple AWS regions.

Amazon concluded its explanation with this message:

Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.

Several observers surveyed by GeekWire pointed to the need for redundancy in cloud storage as a key takeaway from the outage. Redundancy in this case can mean spreading data across multiple regions, so that an outage in one area doesn’t cripple an entire site, or using multiple cloud providers.

Anand Hariharan, vice president of products for Mountainview, Calif.-based Webscale Networks noted that Amazon’s retail website didn’t go down during the outage Tuesday because it didn’t put all its eggs in one cloud basket.

As AWS’ incredibly disruptive outage this week showed, every major public cloud provider has experienced – or will experience – downtime. In fact, more and more of our customers – particularly those running e-commerce businesses – recognize that they can’t just rely on one cloud provider, or one region. Amazon themselves stayed live and fast because they do exactly this – spread their infrastructure across multiple regions. Hours – and really just minutes – of downtime are a lifetime for businesses. Downtime costs not only revenue, but brand reputation and consumer trust, so companies need to consider their multi-region/multi-cloud strategies today.

The internet reacted in a pretty jovial manner to the outage Tuesday, with many taking the outage as a chance for a “digital snow day.” Amazon’s explanation of the outage earned praise from some for the company’s transparency and scorn from others.

Text by Geek

We also recommend

Walmart and Microsoft team up to fight Amazon

Under a five-year deal announced Tuesday, Walmart will boost its use of Microsoft's cloud services and work with the tech firm on artificial intelligence and machine learning projects.

8 years ago

Amazon opens a supermarket with no checkouts

In a move that could revolutionise the way we buy groceries, Amazon opens its first supermarket without checkouts - human or self-service - to shoppers on Monday.

8 years, 6 months ago

Nike is planning to start selling directly through Amazon

Sportswear giant Nike is planning to open a store on Amazon.com, CNN reports. Mark Parker, Nike's CEO, reportedly confirmed that the duo are currently testing out a partnership.

9 years ago

Amazon Makes Whole Foods Mecca For Millennials

Whole Foods Market has delivered on many attributes key to millennial shoppers, yet has not established itself as their grocery store of choice. Despite investment and innovation against this key demographic, Whole Foods has yet to crack their code. Acquisition by Amazon may close the gaps needed to make Whole Foods Market millennials’ retailer of choice.

9 years, 1 month ago

Amazon lent $1 billion to merchants to boost sales on its marketplace

Amazon.com Inc has stepped up lending to third-party sellers on its site who are looking to grow their business, a company executive said in an interview on Wednesday.

9 years, 1 month ago

Apple lawsuit says 90 percent of 'official' chargers sold on Amazon are fake

Ever had to buy a new Apple charger from Amazon? Well beware even those labelled as the genuine article — according to a recent lawsuit filed by Apple, 90 percent of them are fake.

9 years, 9 months ago