On the morning of 12th of April at 0740, a problem at one of the signal boxes at the New South Wales (NSW) RailCorp's Sydenham Advanced Train Running Information Control System (ATRICS) signal box complex caused the loss of control for points, routes and controlled signals for all areas under the control of the complex. In short, all train signals turned red on six CityRail lines into and out of Sydney's Central Business District (CBD). As a result, for several hours over 100,000 passengers became stranded either on trains or waiting for trains.
The Sydenham signal box, after being turned off and then back on again, began to operate correctly at about 0915, but the train system wasn't completely back to normal operations until after 1600.
All in all, some 847 trains were delayed and another 240 trains were canceled during the day because of the single signal box problem. At the time, the commuting nightmare was blamed on a computer glitch.
Well, a report (PDF) that was released last week by NSW RailCorp provides more insight into the exact nature of the "glitch." Apparently, two faulty capacitors along with a poor software design were to blame.
The report states that:
"The ATRICS system is a computerised system that utilises a Local Area Network (LAN) and a Wide Area Network (WAN) to communicate between the various devices that make up the system. The network is protected from other networks in order to prevent external cyber attacks such as hacking, virus etc. The Sydenham LAN is made up from a combination of switches, routers and firewalls in order to achieve the required availability and integrity. The LAN comprises four switches which are connected in a ring topology to provide device and network fault tolerance and to be adaptive to network changes."
"At 7:36:52 on the 12th April 2011, one of the network switches that forms part of the ATRICS LAN at Sydenham (Sydhm_sw1) detected that the adjacent switch was no longer communicating. At 7:37:10 the same network switch detected that the new communicating switch had resumed communicating. This pattern repeated regularly for sw1 and for other switches connected to the network. This pattern indicates that there was not a complete failure but that one of the network switches was cycling from a failed to an operational state. As a result the Sydenham LAN became caught in a cycle where it was continually trying to reconfigure itself to address the changing state of the network."
RailCorp determined that "... the trigger for the event was the failure of two electrolytic capacitors in the Sydenham LAN switch sydhm_sw2. Due to the nature of the capacitor failure, the switch partially failed which placed the whole LAN into a partially failed state."
Full recovery was only possible by turning the system off and restarting it.
"[While] the root cause of the incident was the failure of the Sydenham LAN however; the major contributors to the duration of the outage were the inability of the ATRICS software to manage the scenario created by the failure of the LAN, and the slowness of the Sydenham LAN, which are still under investigation."
RailCorp went on to say that the incident exposed a design weakness in the Sydenham LAN and the ATRICS software, and that it should be a priority to mitigate/eliminate the risk.
The switch that failed had been used for over 8 years without this problem showing up.
Robert N. Charette is a Contributing Editor to IEEE Spectrum and an acknowledged international authority on information technology and systems risk management. A self-described “risk ecologist,” he is interested in the intersections of business, political, technological, and societal risks. Charette is an award-winning author of multiple books and numerous articles on the subjects of risk management, project and program management, innovation, and entrepreneurship. A Life Senior Member of the IEEE, Charette was a recipient of the IEEE Computer Society’s Golden Core Award in 2008.