We go on an IT Hiccups hiatus for a week and wouldn’t you know it, Facebook does a worldwide IT face plant for thirty minutes while mobile phone users of two of the three largest telecom providers in Australia, Optus and Vodafone, coincidentally suffer concurrent nationwide network outages for hours on the same day. Microsoft follows that with back-to-back Office 365-related outages, each lasting more than six plus hours. In addition, there were system operational troubles in Finland, India and New York to name but a few. So, we decided to focus this week’s edition of IT problems, snafus and snarls on the recent outbreak of reported service disruptions that happened around the world as well as those sincere sounding but ultimately vacuous apologies that always now accompany them.
Our operational oofta review begins last Tuesday, when Microsoft’s Exchange Online was disrupted for some users starting from around 0630 to until almost 1630 or so East Coast time, leaving those affected without email, calendar and contact information capability. The disruption was somewhat embarrassing for Microsoft, which likes to tout that its cloud version of Office365 is effectively always available (or at least 99.9% of the time).
PC Advisorreported that Microsoft's investigation into the outage “determined that a portion of the networking infrastructure entered into a degraded state. Engineers made configuration changes on the affected capacity to remediate end-user impact.” Microsoft later explained that the failure uncovered a “previously unknown error” that took some time to correct. Microsoft has chosen to remain mum, however, about how many Exchange users were affected by the interruption of service.
Redmond Magazine published Microsoft’s upbeat apology for the disruption which stated, “We sincerely apologize to customers for any inconvenience this incident may have caused and continuously strive to improve our service and using these opportunities to drive even greater excellence in our service delivery.”
One service improvement suggestion Exchange users vigorously made to Microsoft during the outage was to actually indicate on its service health dashboard that there was, in fact, an outage occurring—something that apparently wasn’t effectively done. Microsoft later admitted sheepishly that the reason for lack of timely notice of the outage was that there was also a problem with the publishing process related to its dashboard. Users also strongly suggested that Microsoft should refrain from using the words “delay” and “opportunities” together in describing future day-long outages since those words didn’t seem to fit with the experiences of users.
Microsoft’s problems on Tuesday were preceded on Monday by “issues” experienced in North America (and reportedly by some outside NA) by users of its Lync instant messaging service, which is also part of the Office 365 suite (which, we should note, is available as a standalone product, as is Exchange). Microsoft indicated that the issues involved its “network routing infrastructure.” The outage began at about 0700 East Coast time, with service disruptions still being reported into early Monday evening.
Interestingly, no one seemed to report on Microsoft’s apology for the Lync outage, which may be because the Exchange outage came so quickly on its heels, or perhaps there are so few users solely dependent upon Lync for their communications that there was really no one around to apologize to.
Microsoft’s dual outages were themselves preceded by a global outage of Facebook that took place the previous week on Thursday, 19 June. That oofta took place at about 0400 East Coast time, and lasted for only about 30 minutes. However, even that short period of time apparently left some of its 1.2 billion users “frustrated,” at least according to London’s Daily Mail.
Facebook initially sent out the expected pro-forma apology, “We're sorry for any inconvenience this may have caused.” Later, however, Facebook spokesman Jay Nancarrow expanded on Facebook's “Sorry, something went wrong” website message users encountered during the service interruption:
“Late last night, we ran into an issue while updating the configuration of one of our software systems. Not long after we made the change, some people started to have trouble accessing Facebook. We quickly spotted and fixed the problem... This doesn't happen often, but when it does we make sure we learn from the experience so we can make Facebook that much more reliable.”
Also on that same Thursday and in a weird coincidence, first Vodafone and then Optus mobile phone customers in Australia reported that they were unable to make calls or send text messages for most of the day. According to a story at the Sydney Morning Herald, Vodafone’s problem began Thursday morning in Western Australia and then soon spread across the country. The paper said that the service disruption stemmed from a combination of a faulty repeater on the primary fiber link connecting Western Australia to the rest of Australia and a back-up cable failing as well. It took until 1830 AEST for the outage to be completely fixed, the Herald reported.
According to a story in the Financial Review, the Optus problem involved an issue with the IMEI (International Mobile Equipment Identity) numbers, which are the distinctive identifiers given each mobile phone. The Review stated that beginning at about 1300 AEST, “the unique IMEI numbers of several thousand mobile handsets were accidentally deemed to be stolen” in the Optus central database, thus, “blocking them from using Optus's mobile network for several hours.” The problem wasn’t completely sorted until after 2100 AEST. Unfortunately for us curious types, how the IMEI error occurred was not explained.
Optus said it was “sorry” for the problem and indicated that the company would give a service credit to compensate customers for their troubles. Similarly, Vodafone’s Chief Technology Officer said he was “sorry” for the telecom's service oofta and indicated that the company's customers would receive unlimited data usage within Australia for a weekend as its compensation.
And in even more of a fluke, a hardware failure that same Thursday afternoon in the Melbourne region of Australia took out Telstra's ADSL service for a couple of hours as well. This outage meant that the three largest telecom companies in Australia were experiencing some sort of newsworthy IT-related failure simultaneously.
Returning to last week, the Finnish broadcasting company (Yle) news program Yle Uutiset reported Tuesday that the Finland-wide government passport system fault had been resolved. Yle stated that that the fault involved “upgrades to the encryption system over midsummer [that] had prevented the passage of information between police and the Population Register Centre.” While Finnish officials at first feared that the fault would prevent the issuance of new passports for possibly weeks, Yle reported that a solution to the problem had been found within a day.
Then last Thursday morning, the India Times reported that the “technical problems” that knocked out the websites of several Indian government ministries including finance and defense as well as the Prime Minister’s Office for several hours on Wednesday evening was fixed. There were no details in the Times as to the root cause of the outage, other than an indication that it was equipment-related and not a cyber-attack.
Finally, Time Warner Cable announced last week that the email problem that affected many of its customers in Central New York State for several weeks was finally resolved. The problem was eventually discovered to be in a “database used by its email servers.” Even though Time Warner early on set the issue as a “top priority" to be fixed, a solution proved frustratingly harder to implement than first thought.
Time Warner spokesman Scott Pryzwansky said in a statement to its customers that, “We apologize if you have been affected by this inconvenience. Providing you with reliable service is our top priority.”
Time Warner may want to work on improving the sincerity of its apology, through. Pryzwansky neglected to mention that Time Warner was going to embrace the opportunity the weeks-long service oofta offered to improve its customers’ service delivery à la Facebook and Microsoft.
Well, maybe next time.
In Other News…
Robert N. Charette is a Contributing Editor to IEEE Spectrum and an acknowledged international authority on information technology and systems risk management. A self-described “risk ecologist,” he is interested in the intersections of business, political, technological, and societal risks. Charette is an award-winning author of multiple books and numerous articles on the subjects of risk management, project and program management, innovation, and entrepreneurship. A Life Senior Member of the IEEE, Charette was a recipient of the IEEE Computer Society’s Golden Core Award in 2008.