The Making of "Lessons From a Decade of IT Failures"

Photo: Randi Klett

In the fall of 2005, IEEE Spectrum published a special report on software that explored the question of why software fails and possible ways to prevent such failures. Not long afterwards, we started this blog, which documents IT project and operational failures, ooftas, and other technical hitches from the around the world.

The tenth anniversary of that report seems to be appropriate time to step back, look through 1750 blog posts, and give an overall impression of what has and hasn’t changed vis a vis software crashes and glitches. Obviously, the risk in both the frequency and cost of cybercrime is one major change, but we decided at least for now to concentrate our limited resources on past unintended IT system development and operations failure.

I deliberated at length with Spectrum’s former senior interactive editor Josh Romero, who’s responsible for the data visualizations in our interactive survey “Lessons From a Decade of IT Failures, about which IT project and operational failures to include. This was a non-trivial task considering that many Risk Factor blog posts were roundups that discuss multiple project failures and operational problems.

To make our work manageable, we decided to include those IT projects and systems that experienced significant trouble. For development projects, this meant being cancelled, suffering a major cost or schedule blowout, or delivering far less than promised. For operational IT systems, suffering a major disruption of some kind qualified it for consideration.

To help winnow down the number of possibilities further, we concentrated on those project failures or incidents where there was reliable documentation of what happened, why it happened, and the consequences, in terms of cost and or people affected. If there is one characteristic that hasn’t changed in terms of IT project development or operational failures, it is the lack of reliable and detailed incident data publicly available. We will discuss this particular issue more thoroughly in a future blog post.

This highlights another aspect of the data we’re using in the “Lessons from a Decade of IT Failures.” The data is skewed not only because of our choices of what to include and not, but because of the data about incidents that actually make it into the public domain. The majority of project failures shown are government projects because they tend to be visible thanks to government accountability mechanisms. Private companies tend to bury their IT failure, so except for the rare lawsuit, their operational failures rarely make it into the news unless the impact a significant number of their customers or government regulators become involved. It should also be obvious that the data is also skewed by our dependence upon English-language news reporting of project failures and operational meltdowns.

Even given the limitations of the data, the lessons we draw from them indicate that IT project failures and operational issues are occurring more regularly and with bigger consequences. This isn’t surprising as IT in all its various forms now permeates every aspect of global society. It is easy to forget that Facebook launched in 2004, YouTube in 2005, Apple’s iPhone in 2007, or that there has been three new versions of Microsoft Windows released since 2005. IT systems are definitely getting more complex and larger (in terms of data captured, stored and manipulated), which means not only are they increasing difficult and costly to develop, but they’re also harder to maintain. Further, when an operational IT system experiences an outage, many more people are affected now than ever before, sometimes “inconveniencing” (to borrow from the lexicon of PR types who have to try to explain these messes) millions or even tens of millions of people globally, a magnitude of technological carnage that prior to 2005 was a relatively rare event.

On top of that, during the past decade we have seen major IT modernization efforts in the airline, banking, financial and healthcare industries, and especially in government, generally aimed at replacing legacy IT systems that went live in the 1980s and 1990s, if not earlier. Many of these efforts have sought to replace multiple disparate IT systems with a single system, which has typically proven to be much more technically and managerially difficult, let alone expensive, than imagined.

There isn’t one right way to look at the interactive graphs and charts we’ve crafted from the documentation we have available. We suggest you just wander through them and then follow the links to more detailed explanations as the mood strikes you. You may be surprised by many major IT failures you may never have heard about, or surprisingly, forgot. Let us know if you think we should add other development or operational in future releases, or if you have better data relating to a project’s cost or impact. We will be releasing more charts and graphs over the next few weeks that will provide other perspectives on IT failures and ooftas gleamed from the Risk Factor blog archives.


Risk Factor

IEEE Spectrum's risk analysis blog, featuring daily news, updates and analysis on computing and IT projects, software and systems failures, successes and innovations, security threats, and more.

Willie D. Jones