Outages Galore: Microsoft, Facebook, Oz Telecom Users are Unhappy Lot

Illustration: Randi Klett

IT Hiccups of the Week

We go on an IT Hiccups hiatus for a week and wouldn’t you know it, Facebook does a worldwide IT face plant for thirty minutes while mobile phone users of two of the three largest telecom providers in Australia, Optus and Vodafone, coincidentally suffer concurrent nationwide network outages for hours on the same day. Microsoft follows that with back-to-back Office 365-related outages, each lasting more than six plus hours. In addition, there were system operational troubles in Finland, India and New York to name but a few. So, we decided to focus this week’s edition of IT problems, snafus and snarls on the recent outbreak of reported service disruptions that happened around the world as well as those sincere sounding but ultimately vacuous apologies that always now accompany them. 

Our operational oofta review begins last Tuesday, when Microsoft’s Exchange Online was disrupted for some users starting from around 0630 to until almost 1630 or so East Coast time, leaving those affected without email, calendar and contact information capability. The disruption was somewhat embarrassing for Microsoft, which likes to tout that its cloud version of Office365 is effectively always available (or at least 99.9% of the time).

PC Advisor reported that Microsoft's investigation into the outage “determined that a portion of the networking infrastructure entered into a degraded state. Engineers made configuration changes on the affected capacity to remediate end-user impact.” Microsoft later explained that the failure uncovered a “previously unknown error” that took some time to correct. Microsoft has chosen to remain mum, however, about how many Exchange users were affected by the interruption of service.

Redmond Magazine published Microsoft’s upbeat apology for the disruption which stated, “We sincerely apologize to customers for any inconvenience this incident may have caused and continuously strive to improve our service and using these opportunities to drive even greater excellence in our service delivery.”

One service improvement suggestion Exchange users vigorously made to Microsoft during the outage was to actually indicate on its service health dashboard that there was, in fact, an outage occurring—something that apparently wasn’t effectively done. Microsoft later admitted sheepishly that the reason for lack of timely notice of the outage was that there was also a problem with the publishing process related to its dashboard. Users also strongly suggested that Microsoft should refrain from using the words “delay” and “opportunities” together in describing future day-long outages since those words didn’t seem to fit with the experiences of users.

Microsoft’s problems on Tuesday were preceded on Monday by “issues” experienced in North America (and reportedly by some outside NA) by users of its Lync instant messaging service, which is also part of the Office 365 suite (which, we should note, is available as a standalone product, as is Exchange).  Microsoft indicated that the issues involved its “network routing infrastructure.” The outage began at about 0700 East Coast time, with service disruptions still being reported into early Monday evening.

Interestingly, no one seemed to report on Microsoft’s apology for the Lync outage, which may be because the Exchange outage came so quickly on its heels, or perhaps there are so few users solely dependent upon Lync for their communications that there was really no one around to apologize to.

Microsoft’s dual outages were themselves preceded by a global outage of Facebook that took place the previous week on Thursday, 19 June. That oofta took place at about 0400 East Coast time, and lasted for only about 30 minutes. However, even that short period of time apparently left some of its 1.2 billion users “frustrated,” at least according to London’s Daily Mail.

Facebook initially sent out the expected pro-forma apology, “We're sorry for any inconvenience this may have caused.” Later, however, Facebook spokesman Jay Nancarrow expanded on Facebook's “Sorry, something went wrong” website message users encountered during the service interruption:

“Late last night, we ran into an issue while updating the configuration of one of our software systems. Not long after we made the change, some people started to have trouble accessing Facebook. We quickly spotted and fixed the problem...  This doesn't happen often, but when it does we make sure we learn from the experience so we can make Facebook that much more reliable.”

Also on that same Thursday and in a weird coincidence, first Vodafone and then Optus mobile phone customers in Australia reported that they were unable to make calls or send text messages for most of the day. According to a story at the Sydney Morning Herald, Vodafone’s problem began Thursday morning in Western Australia and then soon spread across the country. The paper said that the service disruption stemmed from a combination of a faulty repeater on the primary fiber link connecting Western Australia to the rest of Australia and a back-up cable failing as well. It took until 1830 AEST for the outage to be completely fixed, the Herald reported.

According to a story in the Financial Review, the Optus problem involved an issue with the IMEI (International Mobile Equipment Identity) numbers, which are the distinctive identifiers given each mobile phone. The Review stated that beginning at about 1300 AEST, “the unique IMEI numbers of several thousand mobile handsets were accidentally deemed to be stolen” in the Optus central database, thus, “blocking them from using Optus's mobile network for several hours.” The problem wasn’t completely sorted until after 2100 AEST. Unfortunately for us curious types, how the IMEI error occurred was not explained.

Optus said it was “sorry” for the problem and indicated that the company would give a service credit to compensate customers for their troubles. Similarly, Vodafone’s Chief Technology Officer said he was “sorry” for the telecom's service oofta and indicated that the company's customers would receive unlimited data usage within Australia for a weekend as its compensation.

And in even more of a fluke, a hardware failure that same Thursday afternoon in the Melbourne region of Australia took out Telstra's ADSL service for a couple of hours as well. This outage meant that the three largest telecom companies in Australia were experiencing some sort of newsworthy IT-related failure simultaneously.

Returning to last week, the Finnish broadcasting company (Yle) news program Yle Uutiset reported Tuesday that the Finland-wide government passport system fault had been resolved. Yle stated that that the fault involved “upgrades to the encryption system over midsummer [that] had prevented the passage of information between police and the Population Register Centre.” While Finnish officials at first feared that the fault would prevent the issuance of new passports for possibly weeks, Yle reported that a solution to the problem had been found within a day.

Then last Thursday morning, the India Times reported that the “technical problems” that knocked out the websites of several Indian government ministries including finance and defense as well as the Prime Minister’s Office for several hours on Wednesday evening was fixed. There were no details in the Times as to the root cause of the outage, other than an indication that it was equipment-related and not a cyber-attack.

Finally, Time Warner Cable announced last week that the email problem that affected many of its customers in Central New York State for several weeks was finally resolved. The problem was eventually discovered to be in a “database used by its email servers.” Even though Time Warner early on set  the issue as a “top priority" to be fixed, a solution proved frustratingly harder to implement than first thought.

Time Warner spokesman Scott Pryzwansky said in a statement to its customers that, “We apologize if you have been affected by this inconvenience. Providing you with reliable service is our top priority.”

Time Warner may want to work on improving the sincerity of its apology, through. Pryzwansky neglected to mention that Time Warner was going to embrace the opportunity the weeks-long service oofta offered to improve its customers’ service delivery à la Facebook and Microsoft.

Well, maybe next time.

In Other News…

Australian Westpac Subsidiary St George Bank Loses Online Banking Services

GLONASS April Outage Explained

Intercontinental Exchange Glitches Twice Suspend Trading

Illinois Driver License Facilities Shut Due to Mainframe Issues

UK Energy Regulator Tells Npower Fix Persistent Billing Problems or Else

Computer Problems Delays Flights From Israel’s Ben-Gurion Airport

Tennessee Department of Children’s Services IT System Still Suffering Costly Glitches

Problems Linger in New Hampshire's Medicaid Billing Computer System

Target Experiences Nationwide Checkout Problem

Nebraska Racetrack Overpays $6,000 on Wager

Sprint Cable Cut Takes out Emergency 911 in Parts of Oregon and Washington State

Computer Glitch Frees 25 Inmates from Dallas Jail

UK Tax Agency Computer System Riddled with Errors

Dutch Safety Agencies Warns About Potential ILS Interference with Autopilots

Drone Glitch and Bad Decisions Led to Drone Crashing into US Navy Ship

GM Recalls 425,000 SUVs and Trucks for Software Fix

USA-Germany World Cup Match Overloads ESPN Streaming Services

Veteran Affairs Computer Insists Veteran is Dead; He Disagrees

Advertisement

Risk Factor

IEEE Spectrum's risk analysis blog, featuring daily news, updates and analysis on computing and IT projects, software and systems failures, successes and innovations, security threats, and more.

Contributor
Willie D. Jones
 
Advertisement