Root Cause of BART Computer Glitch May Take Weeks to Find

Monday evening, at about 7:30 local time, the entire Bay Area Rapid Transit (BART) system came to a halt for several hours due to a failure of two computers that controllers need to monitor the rail system. According to a story in the San Jose Mercury News yesterday, the failure meant that 28 BART trains had to be sent to the nearest station, where passengers were off-loaded.

A story on KTVU Channel 2 in San Francisco reported that the problem has been traced to a failed network router that for some unknown reason did not communicate its status to another router that should have taken over for it. The failure of the expected smooth cutover kept accurate train status information from reaching BART's Operations Control Center.

The monitoring system was rebooted at 9:50 local time, and BART was able to return to normal operations by 11:15 Monday night.

KTVU reported that BART officials don't know why the network router failed nor why it failed to communicate its inoperable status as required. The officials say it may be weeks before they understand the root cause of the problem. Until then, KTVU reports, a BART "staff member will be on duty to monitor the data intake during all of BART's operating hours until the cause has been pinpointed."

A BART spokesperson apologized profusely to the thousands of stranded passengers for the outage. A San Francisco Chronicle story quoted him as saying:

"We pride ourselves on our 95 percent on-time service. Yesterday was miserable—completely and utterly embarrassing. I want to apologize profusely to our customers. This was not BART."

The Chronicle story also said that computer technicians attempted to reboot the the router that failed along with the one it is supposed to communicate with, "...using the usual process of restarting both simultaneously, but their efforts failed repeatedly. Finally, they decided to take one router out of service and were able to reset the other."

BART has been famous since its inception for its "ghost trains" that still occasionally appear and cause system delays.

There have been a couple of other computer-related outages in rail systems lately. In late June, a computer problem caused chaos on the Greater Manchester Metrolink light rail system network in the UK while another caused a series of power problems for Amtrak trains in the New York City region. A signal system design flaw has also been partially blamed for last month's bullet train crash in China.

Related Stories

Risk Factor

IEEE Spectrum's risk analysis blog, featuring daily news, updates and analysis on computing and IT projects, software and systems failures, successes and innovations, security threats, and more.

Contributor
Willie D. Jones
 

Newsletter Sign Up

Sign up for the ComputerWise newsletter and get biweekly news and analysis on software, systems, and IT delivered directly to your inbox.

Advertisement
Advertisement