System Ingests AT&T Network Logs to Reveal Root Cause of Errors

By analyzing millions of error messages in AT&T’s network data, researchers developed an algorithm that could help carriers detect problems faster

3 min read

Illustration of a large number of phones with errors
Illustration: iStockphoto

Behind the easy connectivity that much of the world enjoys, commercial networks are hard at work establishing connections, authenticating users, and verifying services. When an error occurs, it can be hard for providers to pinpoint the root cause because an error message may be generated in a different spot within a network than the place where the actual error happened.

To hone in on the source of such errors, researchers have analyzed error logs related to millions of messages exchanged through AT&T’s network. The group’s aim was to learn about latent events in particular. Latency errors may cause delays in call propagation and transmission, disconnection issues, and network bottlenecks. Each error event can produce a sequence of messages whose type and frequency could vary based on the latency between the various network elements, network load, and other events.

“We have come up with a set of algorithms that can group the raw error data into events described by important keywords,” says Siddhartha Satpathi, a PhD candidate in electrical engineering at the University of Illinois at Urbana Champaign. “We are not identifying the cause of the events, we are simply separating the messages into groups, where each group consists of messages generated by a single event. Additionally, we identify the key messages which are associated with each event.” Then, a network operator can use these groupings to identify the root cause.

In a real network, Satpathi explains, errors that come from different geographical locations could be related to one another, and sometimes one physical error leads to thousands of error messages. He uses the example of Alice from Illinois who’s visiting California, making a phone call to Bob in New York. Before connecting the call, the base station close to Alice in California needs to verify her credentials, which are in her home station in Illinois.

Once that’s done, the call is routed through the network from California to New York. If a router breaks down somewhere along that network, it would result in error reports from all the connected networks and locations (California, New York, and Illinois). This group of error messages in the error log is what the researchers called an “event.”

That’s where the new algorithm comes in. The size of the error logs makes it impossible for a human engineer to go through the messages and figure out which ones were caused by the same event.

“Our algorithm groups these messages into few important events,” says Satpathi. “It also outputs some frequently occurring messages in these discovered events. This grouping of messages make the message log human interpretable, and can help an engineer decipher the root cause of the error.” The group recently published its work on network message logs in the journal IEEE/ACM Transactions on Networking.

In their research, Satpathi’s team considered comprised 97 million messages, of 39,330 types, sent over 15 days. These included syslog texts (raw-text messages generated by software associated with specific network elements, say a server, relay, or base station to a logging server, and which include a timestamp, and the message text describing the error) and alarms (which indicate specific fault conditions in a network element). The researchers then applied a two-stage algorithm, called Change-point Detection–Latent Dirichlet Allocation (CD-LDA), which uses the existing LDA algorithm as a subroutine, to this data.

The six hours that it took to run LDA on this dataset could be reduced, Satpathi says, by using faster versions of the LDA algorithm. This makes the study “very scalable,” he adds, for detecting errors on a commercial network.

The Conversation (0)