Thursday 2 March 2017

Amazon mystery solved: A typo took down a big chunk of the Internet

The major outage that hit tens of thousands of websites using Amazon's AWS cloud computing service on Tuesday ends up having been the result of a simple typo — just one incorrectly-entered commandThe four-hour outage at Amazon Web Services' S3 system, a giant provider of backend services for close to 150,000 websites, caused disruptions, slowdowns and failure-to-load errors across the United States.Amazon's Simple Storage Service (S3) lets companies use the cloud to store files, photos, video and other information they serve up on their website. It contains literally trillions of these items, known as "objects" to programmers.
When the system was down, websites could not access the photos, logos, lists or data they normally would have pulled from the cloud. While most of the sites didn't go down, many had broken links and were only partly functionalOn Tuesday morning, an Amazon team was investigating a problem that was slowing down the S3billing system.
At 9:37 am Pacific time, one of the team members executed a command that was meant to take a few of the S3 servers offline.
"Unfortunately," Amazon said in its posting, one part of that command was entered incorrectly — i.e. it had a typo.
That mistake caused a larger number of servers to be taken offline than they'd wanted. Two of those servers ran some important systems for the whole East Coast region, such as the ones that let all those trillions of files be placed into customers' websites.
All of this wasn't just affecting Amazon's S3 customers, it was also hitting other Amazon cloud customers as well — because it turns out those systems use S3, too.
While Amazon says it designed its system to work even if big parts failed, it also acknowledged that it hadn't actually done a full restart on the main subsystems that went offline "for many years."
During that time, the S3 system had gotten a whole lot bigger, so restarting it, and doing all the safety checks to make sure its files hadn't gotten corrupted in the process, took much longer than expected
It wasn't until 1:54 pm Pacific time, four hours and 17 minutes after the mistyped command was first entered, that the entire system was back up and running.
To make sure the problem doesn't happen again, Amazon has rewritten its software tools so its engineers can't make the same mistake, and it's doing safety checks elsewhere in the system.
Amazon apologized to its customers for the event, saying it "will do everything we can to learn from this event and use it to improve our availability even further."
YHIS REPORT BY ME  ( SURAJ)

No comments:

Post a Comment