Most software bugs won’t kill you. A possibly lethal exception could be the error that leads a self-driving car’s AI to make the wrong decision at the wrong time. That is why researchers developed a bug-hunting method that can systematically expose bad decision-making by the deep learning algorithms deployed in online services and autonomous vehicles.
The new DeepXplore method [PDF] uses at least three neural networks—the basic architecture of deep learning algorithms—to act as “cross-referencing oracles” in checking each other’s accuracy. Researchers at Columbia University and Lehigh University designed DeepXplore to solve an optimization problem in which they looked to strike the best balance between two objectives: maximizing the number of neurons activated within neural networks, and triggering as many conflicting decisions as possible among different neural networks. By assuming that the majority of neural networks will generally make the right decision, DeepXplore automatically retrains the neural network that made the lone dissenting decision to follow the example of the majority in a given scenario.
“This is a differential testing framework that can find thousands of errors in self-driving systems and in similar neural network systems,” says Yinzhi Cao, assistant professor of computer science at Lehigh University in Bethlehem, Pa.
Cao and his colleagues on the DeepXplore team recently won best paper after presenting their research at the 2017 Symposium on Operating Systems Principles (SOSP) held in Shanghai, China, from 28-31 Oct. Their win may signal a growing recognition of the need for debugging tools such as DeepXplore in deep learning AI.
Typically, deep learning algorithms become better at certain tasks by filtering huge amounts of training data that humans have labeled with the correct answers. That has enabled such algorithms to achieve accuracies of well over 90 percent on certain test datasets that involve tasks such as identifying the correct human faces in Facebook photos or choosing the correct phrase in a Google translation between, say, Chinese and English. In these cases, it’s not the end of the world if a friend occasionally gets misidentified or if a certain esoteric phrase gets translated incorrectly.
This example from DeepXplore shows an error found in Nvidia's DAVE-2 self-driving car software, which would cause the car to crash into a guardrail due to a darker version of the image.Images: Columbia University/Lehigh University/ACM
But the consequences of mistakes rise sharply once tech companies begin using deep learning algorithms in applications such as one where a two-ton machine is moving at highway speeds. A wrong decision by a self-driving AI could lead to the car crashing into a guard rail, colliding with another vehicle or running down pedestrians and cyclists. Government regulators will want to know for sure that self-driving cars can meet a certain safety level—and random test datasets may not uncover all those rare “corner cases” that could lead an algorithm to make a catastrophic mistake.
“I think this push toward secure and reliable AI kind of fits in nicely with explainable AI,” says Suman Jana, an assistant professor of computer science at Columbia University in New York City. “Transparency, explanation and robustness all have to be improved a lot in machine learning systems before these systems can start working together with human beings or start running on roads.”
Jana and Cao come from a group of researchers who share backgrounds in software security and debugging. In their world, even software that is 99-percent error free could still be vulnerable if malicious hackers can exploit that one lone bug in the system. That has made them far less tolerant of errors than many deep learning researchers who see mistakes as a natural part of the training process. It also made them fairly ideal candidates to figure out a new and more comprehensive approach for debugging deep learning.
Until now, debugging of the neural networks in self-driving cars has involved fairly tedious or random methods. One random testing approach involves human researchers manually creating test images and feeding those into the networks until they triggered a wrong decision. A second approach, called adversarial testing, can automatically create a sequence of test images by slightly tweaking one particular image until it trips up the neural network.
DeepXplore took a different approach by automatically creating test images most likely to cause three or more neural networks to make conflicting decisions. For example, DeepXplore might look for just the right amount of lighting in a given image that could lead two neural networks to identify a vehicle as a car while a third neural network identifies it as a face.
At the same time, DeepXplore also aimed to maximize neuron coverage in its testing by activating the maximum number of neurons and different neural network pathways. Such neuron coverage is based on a similar concept in traditional software testing called code coverage, Cao explains. This process was able to activate 100 percent of network neurons, or about 30 percent more on average than either the random or adversarial testing methods previously used in deep learning algorithms.
Testing with 15 state-of-the-art neural networks looking at five different public datasets showed how DeepXplore could find thousands of previously undiscovered errors in a wide variety of deep learning applications. The test datasets included scenarios for self-driving car AI, automatic object recognition in online images, and automatic detection of malware masquerading as ordinary software.
DeepXplore cannot yet guarantee that it has found every single possible bug in a system, but it’s far more comprehensive in testing large-scale neural networks than previous methods, Jana says. By comparison, a Stanford University team has taken an almost opposite approach by showing how to guarantee a small cluster of neurons is free of errors. Neither is a complete solution, but both represent promising and crucial steps toward debugging the future of deep learning.
Jeremy Hsu has been working as a science and technology journalist in New York City since 2008. He has written on subjects as diverse as supercomputing and wearable electronics for IEEE Spectrum. When he’s not trying to wrap his head around the latest quantum computing news for Spectrum, he also contributes to a variety of publications such as Scientific American, Discover, Popular Science, and others. He is a graduate of New York University’s Science, Health & Environmental Reporting Program.