There are several news reports today like these at the BBC, MSNBC and ZDNet about problems plaguing Amazon's Elastic Compute Cloud (EC2) web hosting services that have brought down or have caused problems at scores of web sites, including Foursquare, Quora, Reddit and Hootsuite, among others.
Amazon said at its web services service health dashboard that the problem began around 0141 PDT at its EC2 site in Northern Virginia when "latency and error rates with EBS [Elastic Block Store] volumes and connectivity issues reaching EC2 instances in the US-EAST-1 region" occurred.
The problem then apparently cascaded into Amazon's relational database service in Northern Virginia as well as with Amazon's Web Service (AWS) Elastic Beanstalk, which is a way for developers to quickly deploy and manage applications in the AWS cloud.
At 0854 PDT it said:
"We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them."
There are no more recent messages about when service will be completely restored, although some of the affected web sites are saying that things are getting back to normal. A recent Bloomberg News report at 1244 EST, however, says that "scant headway" is being made in fixing the problems.
Look for this event to cause more questions to be raised about the reliability of cloud computing.
Robert N. Charette is a Contributing Editor to IEEE Spectrum and an acknowledged international authority on information technology and systems risk management. A self-described “risk ecologist,” he is interested in the intersections of business, political, technological, and societal risks. Charette is an award-winning author of multiple books and numerous articles on the subjects of risk management, project and program management, innovation, and entrepreneurship. A Life Senior Member of the IEEE, Charette was a recipient of the IEEE Computer Society’s Golden Core Award in 2008.