The December 2022 issue of IEEE Spectrum is here!

Close bar

On the morning of 12th of April at 0740, a problem at one of the signal boxes at the New South Wales (NSW) RailCorp's Sydenham Advanced Train Running Information Control System (ATRICS) signal box complex caused the loss of control for points, routes and controlled signals for all areas under the control of the complex. In short, all train signals turned red on six CityRail lines into and out of Sydney's Central Business District (CBD). As a result, for several hours over 100,000 passengers became stranded either on trains or waiting for trains.

The Sydenham signal box, after being turned off and then back on again, began to operate correctly at about 0915, but the train system wasn't completely back to normal operations until after 1600.

All in all, some 847 trains were delayed and another 240 trains were canceled during the day because of the single signal box problem. At the time, the commuting nightmare was blamed on a computer glitch.

Well, a report (PDF) that was released last week by NSW RailCorp provides more insight into the exact nature of the "glitch." Apparently, two faulty capacitors along with a poor software design were to blame.

The report states that:

"The ATRICS system is a computerised system that utilises a Local Area Network (LAN) and a Wide Area Network (WAN) to communicate between the various devices that make up the system. The network is protected from other networks in order to prevent external cyber attacks such as hacking, virus etc. The Sydenham LAN is made up from a combination of switches, routers and firewalls in order to achieve the required availability and integrity. The LAN comprises four switches which are connected in a ring topology to provide device and network fault tolerance and to be adaptive to network changes."

"At 7:36:52 on the 12th April 2011, one of the network switches that forms part of the ATRICS LAN at Sydenham (Sydhm_sw1) detected that the adjacent switch was no longer communicating. At 7:37:10 the same network switch detected that the new communicating switch had resumed communicating. This pattern repeated regularly for sw1 and for other switches connected to the network. This pattern indicates that there was not a complete failure but that one of the network switches was cycling from a failed to an operational state. As a result the Sydenham LAN became caught in a cycle where it was continually trying to reconfigure itself to address the changing state of the network."

RailCorp determined that "... the trigger for the event was the failure of two electrolytic capacitors in the Sydenham LAN switch sydhm_sw2. Due to the nature of the capacitor failure, the switch partially failed which placed the whole LAN into a partially failed state."

Full recovery was only possible by turning the system off and restarting it.

In addition,

"[While] the root cause of the incident was the failure of the Sydenham LAN however; the major contributors to the duration of the outage were the inability of the ATRICS software to manage the scenario created by the failure of the LAN, and the slowness of the Sydenham LAN, which are still under investigation."

RailCorp went on to say that the incident exposed a design weakness in the Sydenham LAN and the ATRICS software, and that it should be a priority to mitigate/eliminate the risk.

The switch that failed had been used for over 8 years without this problem showing up.

The Conversation (0)

Why Functional Programming Should Be the Future of Software Development

It’s hard to learn, but your code will produce fewer nasty surprises

11 min read
Vertical
A plate of spaghetti made from code
Shira Inbar
DarkBlue1

You’d expectthe longest and most costly phase in the lifecycle of a software product to be the initial development of the system, when all those great features are first imagined and then created. In fact, the hardest part comes later, during the maintenance phase. That’s when programmers pay the price for the shortcuts they took during development.

So why did they take shortcuts? Maybe they didn’t realize that they were cutting any corners. Only when their code was deployed and exercised by a lot of users did its hidden flaws come to light. And maybe the developers were rushed. Time-to-market pressures would almost guarantee that their software will contain more bugs than it would otherwise.

Keep Reading ↓Show less
{"imageShortcodeIds":["31996907"]}