Amazon Says Cloud Problems Caused by Configuration Change

Amazon released on Friday a 5,700+ word summary post-mortem of why a portion of its web cloud services embarrassingly went down two weeks ago.

The long and the short of it was that a network change to upgrade its network's capacity went bad, which caused other bad things to happen.

Amazon's summary explanation states:

"At 12:47 AM PDT on April 21st, a network change was performed as part of our normal AWS [Amazon Web Services] scaling activities in a single Availability Zone in the US East Region. The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS [Elastic Block Storage] network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving. As a result, many EBS nodes in the affected Availability Zone were completely isolated from other EBS nodes in its cluster. Unlike a normal network interruption, this change disconnected both the primary and secondary network simultaneously, leaving the affected nodes completely isolated from one another."

You can read the rest of the summary to see what the node isolation caused, which included a "re-mirroring storm," the task that Amazon faced when trying to fix the cascading problems, and what it has done and intends to do to try to keep this from being a problem in the future.

Amazon may also want to reiterate to its customers to plan for outages in the future, and configure their web sites accordingly.

Amazon says that it will be providing an automatic 10-day service credit as compensation for the outage, although some of its customers are grousing that it is not enough, says an article at the Wall Street Journal. The company hasn't said how it plans to compensate those whose data they could not recover or how much the compensation (and effort to resolve the outage) is costing. I guess we will have to wait for Amazon's quarterly report to find out.

internet networks it IT glitch cloud computing Amazon

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enhance Your Tech and Business Skills During IEEE Education Week

Intel’s Gaudi 3 Goes After Nvidia

Generator Redesign Tries to Catch a Good Wave

Related Stories

Why the AI Boom is a Windfall for Tiny Anguilla

Cory Doctorow: Interoperability Can Save the Open Web

Bob Kahn on the Birth of “Inter-networking”

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

Amazon Says Cloud Problems Caused by Configuration Change

Apologizes again, promises to do better

Enhance Your Tech and Business Skills During IEEE Education Week

Intel’s Gaudi 3 Goes After Nvidia

Generator Redesign Tries to Catch a Good Wave

Related Stories

Why the AI Boom is a Windfall for Tiny Anguilla

Cory Doctorow: Interoperability Can Save the Open Web

Bob Kahn on the Birth of “Inter-networking”