Monday evening, at about 7:30 local time, the entire Bay Area Rapid Transit (BART) system came to a halt for several hours due to a failure of two computers that controllers need to monitor the rail system. According to a story in the San Jose Mercury News yesterday, the failure meant that 28 BART trains had to be sent to the nearest station, where passengers were off-loaded.
A story on KTVU Channel 2 in San Francisco reported that the problem has been traced to a failed network router that for some unknown reason did not communicate its status to another router that should have taken over for it. The failure of the expected smooth cutover kept accurate train status information from reaching BART's Operations Control Center.
The monitoring system was rebooted at 9:50 local time, and BART was able to return to normal operations by 11:15 Monday night.
KTVU reported that BART officials don't know why the network router failed nor why it failed to communicate its inoperable status as required. The officials say it may be weeks before they understand the root cause of the problem. Until then, KTVU reports, a BART "staff member will be on duty to monitor the data intake during all of BART's operating hours until the cause has been pinpointed."
"We pride ourselves on our 95 percent on-time service. Yesterday was miserable—completely and utterly embarrassing. I want to apologize profusely to our customers. This was not BART."
The Chronicle story also said that computer technicians attempted to reboot the the router that failed along with the one it is supposed to communicate with, "...using the usual process of restarting both simultaneously, but their efforts failed repeatedly. Finally, they decided to take one router out of service and were able to reset the other."
There have been a couple of other computer-related outages in rail systems lately. In late June, a computer problem caused chaos on the Greater Manchester Metrolink light rail system network in the UK while another caused a series of power problems for Amtrak trains in the New York City region. A signal system design flaw has also been partially blamed for last month's bullet train crash in China.
Robert N. Charette is a Contributing Editor to IEEE Spectrum and an acknowledged international authority on information technology and systems risk management. A self-described “risk ecologist,” he is interested in the intersections of business, political, technological, and societal risks. Charette is an award-winning author of multiple books and numerous articles on the subjects of risk management, project and program management, innovation, and entrepreneurship. A Life Senior Member of the IEEE, Charette was a recipient of the IEEE Computer Society’s Golden Core Award in 2008.