This edition of IT Hiccups of the Week revisits the 911 emergency call system outages that affected all of Washington State and parts of Oregon just before midnight, 9 April 2014. As I wrote at the time, CenturyLink—a telecom provider from Louisiana that is contracted by Washington State and the three affected counties in Oregon to provide 911 communication services—blamed the outages, which lasted several hours each, on a “technical error by a third party vendor.”
CenturyLink gave few details in the aftermath of the outages other than to say that the Washington State and Oregon outages were merely an “uncanny” coincidence, and to send out the standard “sorry for the inconvenience” press release apology. The company estimated that approximately 4,500 emergency calls to 911 call centers went unanswered during the course of the Washington State outage. No details were available regarding the number of failed 911 calls there were during the two-hour Oregon outage, which affected some 16,000 phone customers.
Well, 10 days ago, the U.S. Federal Communications Commission released its investigative report into the emergency system outages. It cast a much different light on the Washington State “sunny day” outage (i.e., not caused by bad weather or a natural disaster) that CenturyLink initially tried to play down. FCC Chairman Tom Wheeler even went so far as to call the report’s findings “terrifying.”
As it turns out, while the 911 system outages that hit Oregon and Washington State were indeed coincidental, they were also connected in a strange sort of way that caused a lot of confusion at the time, as we will shortly see. More importantly, the 911 outage that affected Washington State on that April night didn’t just affect that state, but also emergency calls being made in California, Florida, Minnesota, North Carolina, Pennsylvania and South Carolina. In total, some 6,600 emergency calls made over a course of six hours across the seven states went unanswered.
As the FCC report notes, because of the multi-state emergency system outage, “Over 11 million Americans … or about three and half percent of the population of the United States, were at risk of not being able to reach emergency help through 911.” Since the outage happened very late at night into the early morning and there was no severe weather in the affected regions, the emergency call volume was very low; luckily, no one died because of their inability to reach 911.
The cause of the outage, the FCC says, was a preventable “software coding error” in a 911 Emergency Call Management Center (ECMC) automated system in Englewood, Colorado, operated by Intrado, a subsidiary of West Corporation. Intrado, the FCC report states, “is a provider of 911 and emergency communications infrastructure, systems, and services to communications service providers and to state and local public safety agencies throughout the United States… Intrado provides some level of 911 function for over 3,000 of the nation’s approximately 6,000 PSAPs .”
As succintly explained in an article in the Washington Post, “Intrado owns and operates a routing service, taking in 911 calls and directing them to the most appropriate public safety answering point, or PSAP, in industry parlance. Ordinarily, Intrado's automated system assigns a unique identifying code to each incoming call before passing it on—a method of keeping track of phone calls as they move through the system.”
“But on April 9, the software responsible for assigning the codes maxed out at a pre-set limit [at 11:54 p.m. PDT]; the counter literally stopped counting at 40 million calls. As a result, the routing system stopped accepting new calls, leading to a bottleneck and a series of cascading failures elsewhere in the 911 infrastructure,” the Post article went on to state.
All told, 81 PSAPs across the seven states were unable to receive calls; dialers to 911 heard only “fast busy” signals.
When the software hit its 40 million call limit, the FCC report says, the emergency call-routing system did not send out an operator alarm for over an hour. When it finally did, the system monitoring software indicated that the problem was a “low level” problem; surprisingly, it did not immediately alert anyone that emergency calls were no longer being processed.
As a result, Intrado’s emergency call management center personnel did not realize the severity of the outage, nor did they get any insight into its cause, the FCC report goes on to state. In addition, the ECMC personnel were already distracted with alarms they were receiving involving the Oregon outage also involving Century link.
Worse still, says the FCC, the low-level alarm designation not only failed to get ECMC personnel’s attention, but it also prevented an automatic rerouting of 911 emergency calls to Intrado’s ECMC facility in Miami.
It wasn’t until 2:00 a.m. PDT on 10 April that ECMC personnel became aware of the outage. That, it seems, happened only because CenturyLink called to alert them that its PSAPs in Washington State were complaining of an outage. After the emergency call management center personnel received the CenturyLink call, both they and CenturyLink thought the Washington State and Oregon outages were somehow closely interconnected. It took several hours for them to realize that they were entirely separate and unrelated events, the FCC report states. Apparently, it wasn’t until other several other states’ PSAPs and 911 emergency system call providers started complaining of outages that call management center personnel and CenturyLink realized the true scope of the 911 call outage, and were finally able zero in on the cause.
Once the root cause was discovered, the Colorado-based ECMC personnel initiated a manual failover of 911 call traffic to Intrado’s ECMC Miami site at 6:00 a.m. PDT. When problems plaguing the Colorado site were fixed later that morning, traffic was rerouted back.
The FCC report states that, “What is most troubling is that this is not an isolated incident or an act of nature. So-called ‘sunny day’ outages are on the rise. That’s because, as 911 has evolved into a system that is more technologically advanced, the interaction of new [Next Generation 911 (NG911)] and old [traditional circuit-switched time division multiplexing (TDM)] systems is introducing fragility into the communications system that is more important in times of dire need.”
IEEE Spectrum published an article in March of this year that explains the evolution of 911 in the U.S. (and Europe) and provides good insights into some of the difficulties of transitioning to NG911. The FCC’s report also goes into some detail on how the transition from traditional 911 service to NG911 can create subtle problems that are difficult to unravel when a problem does occur.
According to a story at Telecompetitor.com, Rear Admiral David Simpson, chief of the FCC’s Public Safety and Homeland Security Bureau, told the FCC during a hearing into the outage that there were three additional major “sunny day” outages in 2014, though none were ever reported before this year. All three—which I believe involved outages in Hawaii, Vermont and Indiana—involved NG911 implementations or time division multiplexing–to-IP transitions, Simpson said.
The FCC report indicates that Intrado has made changes to its call routing software and monitoring systems to prevent this situation from happening again, but it also said that 911 emergency service providers need to examine their system architecture designs. The hope is that they’ll better understand how and why their systems may fail, and what can be done to keep the agencies operating when they do. In addition, the communication of outages among all the emergency service providers and PSAPs needs to be improved; the April incident highlighted how miscommunications hampered finding the extent and cause of the outage.
Finally, the five FCC Commissioners unanimously agreed that such an outage was “simply unacceptable” and that future “lapses cannot be permitted.” While no one died this time, they note that next time everyone may not be so lucky.
In Other News…
Contributing Editor Robert N. Charette is an acknowledged international authority on information technology and systems risk management. A self-described “risk ecologist,” he is interested in the intersections of business, political, technological, and societal risks. Along with being editor for IEEE Spectrum’s Risk Factor blog, Charette is an award-winning author of multiple books and numerous articles on the subjects of risk management, project and program management, innovation, and entrepreneurship. A Life Senior Member of the IEEE, Charette was a recipient of the IEEE Computer Society’s Golden Core Award in 2008.