If you had plans to travel on Delta Airlines today or tomorrow, you had plans. At this point, you might want to consider another carrier, a rental car, a bus, or even a bicycle with a basket on the handlebars. That’s because, according to Delta, a power outage that wreaked havoc on its hub in Atlanta brought down the airline’s computer system. And one expert says it’s just the latest data point in a string of airline industry IT problems.
The outage wasn’t just local—Delta flights all across the globe were affected by the computer system shutdown, which lasted for around six hours and ended just before 9 a.m. eastern time this morning. Flights already in progress were completed as scheduled, but the computer snafu meant that, in the words of an old commercial for roach motels, planes could check in, but they couldn’t check out.
Though Delta officials told CNN that flight departures have resumed on a limited basis, “Customers heading to the airport should expect delays and cancellations.” Thus far, the world’s second-largest airline has canceled hundreds of its 15,000 daily flights. The reverberations from those flights are bound to last for days.
The airline’s advice to check its website for up-to-the-minute information “was a something of a joke,” says Robert Charette, a self-described “risk ecologist” who is an internationally acknowledged authority and pioneer in risk management, information systems and technology, and systems engineering. Charette adds that, “As Minnesota Public Radio said this morning, the information on the website was incorrect because the computer system was down. I don’t know how long it will take—now that the computer system is back online—for all the worldwide flight data to be brought up to date. I definitely wouldn’t want to be booked on a Delta flight today or tomorrow.”
Who, us? Hacked?
Delta’s spokespeople came right out of the gate (pun intended) trying to get ahead of any concern that the computer outage resulted from the work of hackers. Their reasoning must have been: The damage to the carrier’s reputation—and its bottom line—that would occur if travelers suspected that there are security concerns would be far larger than the fallout from stranding hundreds of thousands of passengers because of what it considers an uncontrollable event.
But the airline still doesn’t get a pass. Today’s incident is but the latest in a string of “isolated incidents” that together reveal a pattern that is cause for concern. Charette puts today’s outage in context:
There are have been several reservation system outages that have hit worldwide airline ops with distressing regularity over the past few years. Southwest Airlines had one just a few weeks ago. (It had another big one June of 2013 and another in October 2015.) What you’ll see in reviewing them is recurring problems with infrastructure (i.e., power, networks, routers, servers, etc.) that seem to keep surprising the airlines. In every case I can recall, there were backup systems in place, but they failed—another recurring theme. The Southwest CEO claimed that the last outage—caused by a router—was equivalent to a 1000-year flood. Not only was that a comical overstatement, but it also shows the thinking that is probably [leading to the airlines] skimping on contingency management preparations.
Charette knows of what he speaks. For years, he was the lead contributor to IEEE Spectrum’s Risk Factor blog, where he catalogued IT failures both big, small, foreseeable, and inexplicable. Last October, he authored the special report, “Lessons From a Decade of IT Failures,” which examined the takeaways from the blog’s tracking of the big IT debacles of the previous decade.
Based on recent events, it’s clear that 1000-year floods occur far more frequently than they used to.
Willie Jones is an associate editor at IEEE Spectrum. In addition to editing and planning daily coverage, he manages several of Spectrum's newsletters and contributes regularly to the monthly Big Picture section that appears in the print edition.