There was a small story in ComputerWorld over the weekend reporting that a card in a New York City Verizonpeering router that handles traffic for Verizons' DSL and FIOS services between its network and the Internet went bad about 1515 EDT Friday. This caused an outage in some areas of New York and the Northeast US for about 40 minutes before it was fixed.
What caught my eye in this report was a paragraph in the ComputerWorld story pointing to a Twitter post on the web by Verizon Senior Vice President Eric Rabe, in which he wrote:
"When routers have problems, they are designed to report that they are sick. Internet traffic is rerouted to adjacent routers automatically and sent around the trouble spot. In this case, that didn't happen. The router went into a hung state and did not appear to the rest of the network as though it was having problems.
That meant that some user traffic from the northeast continued to flow to the stalled router, but couldn't be processed. Presto - an outage for those users."
An analyst quoted in the ComputerWorld story said that it was "highly unusual" that one router could take down large portions of a network. He pointed out that this happens more often when someone accidentally cuts a cable, which happened twice in the past 18 months (here and here) in Australia.
When I read about the problems with the router, it also reminded me a bit of the DC Metro crash where the US National Transportation Safety Board (NTSB) discovered that a failure occurred in which a spurious signal generated by a track circuit module transmitter mimicked a valid signal and bypassed the rails via an unintended signal path. The spurious signal was sensed by the module receiver which resulted in a train not being detected when it stopped in the track circuit where the accident occurred.
Unfortunately here, "presto" resulted in a crash.
Expecting the unexpected is a fundamental engineering principle that we seem to be constantly in need of being reminded of.
Or as the Greek philosopher Heraclitus wrote some 2,500 years ago, "If you do not expect the unexpected, you will not find it; for it is hard to be sought out, and difficult.”
Robert N. Charette is a Contributing Editor to IEEE Spectrum and an acknowledged international authority on information technology and systems risk management. A self-described “risk ecologist,” he is interested in the intersections of business, political, technological, and societal risks. Charette is an award-winning author of multiple books and numerous articles on the subjects of risk management, project and program management, innovation, and entrepreneurship. A Life Senior Member of the IEEE, Charette was a recipient of the IEEE Computer Society’s Golden Core Award in 2008.