It being the Thanksgiving holiday week in the United States, I was tempted to write once more about the LA Unified School District’s MiSiS turkey of a project, which the LAUSD Inspector General fully addressed in a report [pdf] released last week. If you like your IT turkey burnt to a crisp, over-stuffed with project management arrogance, served with heapings of senior management incompetence, and topped off a ladleful of lumpy gravy of technical ineptitude, you’ll feast mightily on the IG report. However, if you are a parent of the over 1,000 LAUSD school district students who still have not received a class schedule nearly 40 percent of the way into the academic year—or a Los Angeles taxpayer for that matter—you may get extreme indigestion from reading it.
However, the winner of the latest IT Hiccup of the Week award goes to Microsoft for the intermittent outages that hit its Azure cloud platform last Wednesday, disrupting an untold number of customer websites along with Microsoft Office 365, Xbox Live , and other services across the United States, Europe, Japan, and Asia. The outages occurred over an 11-hour (and in some cases longer) period.
According a detailed postby Microsoft Azure corporate vice president Jason Zanderon, the outage was caused by “a bug that got triggered when a configuration change in the Azure Storage Front End component was made, resulting in the inability of the Blob [Binary Large Object] Front-Ends to take traffic.”
The configuration change was made as part of a “performance update” to Azure Storage, that when made, exposed the bug, and “resulted in reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services.” The bug, which had escaped detection during “several weeks of testing,” caused the storage Blob Front-Ends to go into an infinite loop, Zander stated. “The net result,” he wrote, “was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues.”
Once the error was detected, the configuration change was rolled backed immediately. However, the Blob Front-Ends needed a restart to halt their infinite looping, which slowed the recovery time, Zander wrote.
The effects of the bug could have been contained, except that Zander indicated someone apparently didn’t follow standard procedure in rolling out the performance update.
“Unfortunately the issue was wide spread, since the update was made across most regions in a short period of time due to operational error, instead of following the standard protocol of applying production changes in incremental batches.”
Zander apologized for the “inconvenience” and says that it is going to “closely examine what went wrong and ensure it never happens again.”
In Other News…
Robert N. Charette is a Contributing Editor to IEEE Spectrum and an acknowledged international authority on information technology and systems risk management. A self-described “risk ecologist,” he is interested in the intersections of business, political, technological, and societal risks. Charette is an award-winning author of multiple books and numerous articles on the subjects of risk management, project and program management, innovation, and entrepreneurship. A Life Senior Member of the IEEE, Charette was a recipient of the IEEE Computer Society’s Golden Core Award in 2008.