Well, Amazon seems to have finally cleared almost all of its cloud computing problems from last week. According to a post at its web servicesservice health dashboard last night at 1939 PDT:
"As we posted early this morning, RDS [relational database service] is now operating normally in all Availability Zones for all APIs[application programming interfaces]. In addition, access to the vast majority of affected database instances has now been restored. We're in the process of contacting a limited number of customers who have database instances that are not yet accessible and will continue to work hard on restoring access to this small number of remaining database instances...."
"We are digging deeply into the root causes of this event and will post a detailed post mortem."
Amazon's post mortem will no doubt be highly scrutinized given the visibility and impact of the outage. That said, as pointed out by this nice post at the Data Knowledge Center, some Amazon web services customers like Netflix and SnugMug did not go down because they apparently read and took heed of Amazon's warnings/advice (see PDF) about hosting web application in its cloud which says to "... make sure that there are provisions for migrating single points of access across Availability Zones in the case of failure." Other customers who went down for the count like EveryBlock admitted that they didn't and in retrospect, should have.
Sony's Playstation Network which also went down last week is still down and may be for quite some time. Sony said late Friday in a post at its Playstation blog site:
"An external intrusion on our system has affected our PlayStation Network and Qriocity services. In order to conduct a thorough investigation and to verify the smooth and secure operation of our network services going forward, we turned off PlayStation Network & Qriocity services on the evening of Wednesday, April 20th. Providing quality entertainment services to our customers and partners is our utmost priority. We are doing all we can to resolve this situation quickly, and we once again thank you for your patience. We will continue to update you promptly as we have additional information to share."
Over the weekend, Sony went on to say:
"We sincerely regret that PlayStation Network and Qriocity services have been suspended, and we are working around the clock to bring them both back online. Our efforts to resolve this matter involve re-building our system to further strengthen our network infrastructure. Though this task is time-consuming, we decided it was worth the time necessary to provide the system with additional security.
We thank you for your patience to date and ask for a little more while we move towards completion of this project. We will continue to give you updates as they become available."
Questions, however, are being raised about why Sony has been so closed-mouth about the situation, why it doesn't seem to have a recovery plan in place, and even whether the intrusion explanation is the full story.
One thing is for sure, though: Sony is going to lose a lot of revenue from the outage and probably more than a few of its 70 million gamers to Microsoft the longer the outage lasts.
Robert N. Charette is a Contributing Editor to IEEE Spectrum and an acknowledged international authority on information technology and systems risk management. A self-described “risk ecologist,” he is interested in the intersections of business, political, technological, and societal risks. Charette is an award-winning author of multiple books and numerous articles on the subjects of risk management, project and program management, innovation, and entrepreneurship. A Life Senior Member of the IEEE, Charette was a recipient of the IEEE Computer Society’s Golden Core Award in 2008.