Autonomous Security Bots Seek and Destroy Software Bugs in DARPA Cyber Grand Challenge

The team behind the victorious Cyber Reasoning System will receive a US $2 million prize

3 min read
A cybersecurity illustration showing a keyboard with a graphic of a bug superimposed on top.
Photo: Steven Puetzer/Getty Images

The mission: to detect and patch as many software flaws as possible. The competitors: seven dueling supercomputers about the size of large vending machines, each emblazoned with a name like Jima or Crspy, and programmed by expert hacker teams to autonomously find and fix malicious bugs.

These seven “Cyber Reasoning Systems” took the stage on Thursday for DARPA’s Cyber Grand Challenge at the Paris Hotel and Conference Center in Las Vegas, Nev. They were competing for a $2 million grand prize in the world’s first fully autonomous “Capture the Flag” tournament. After eight hours of grueling bot-on-bot competition, DARPA declared a system named Mayhem, built by Pittsburgh, Pa.-based ForAllSecure as the unofficial winner. The Mayhem team was led by David BrumleyXandra, produced by TECHX from GammaTech and the University of Virginia, placed second to earn a $1 million prize; and Mechanical Phish by Shellphish, a student-led team from Santa Barbara, Calif., took third place, worth $750,000.

DARPA is verifying the results and will announce the official positions on Friday. The triumphant bot will then compete against human hackers in a “Capture the Flag” tournament at the annual DEF CON security conference. Though no one expects one of these reasoning systems to win that challenge, it could solve some types of bugs more quickly than human teams.

Darpa hopes the competition will pay off by bringing researchers closer to developing software repair bots that could constantly scan systems for flaws or bugs and patch them much faster and more effectively than human teams can. DARPA says quickly fixing such flaws across billions of lines of code is critically important. It could help to harden infrastructure such as power lines and water treatment plants against cyberattacks, and to protect privacy as more personal devices come online.

But no such system has even been available on the market. Instead, teams of security specialists constantly scan code for potential problems. On average, it takes specialists 312 days to discover a software vulnerability and often months or years to actually fix it, according to DARPA CGC host Hakeem Oluseyi.

“A final goal of all this is scalability,” says Michael Stevenson, Mission Manager for the Deep Red team from Raytheon. “If [the bots] discover something in one part of the network, these are the technologies that can quickly reach out and patch that vulnerability throughout that network.” The original 2005 DARPA Grand Challenge jumpstarted corporate and academic interest in autonomous cars.

imgThis visualization shows network traffic flows for the bot Rubeus as it receives verification of software bugs from competitors.Image: DARPA

The teams were not told what types of defects their systems would encounter in the finale, so their bots had to reverse engineer DARPA’s challenge software, identify potential bugs, run tests to verify those bugs, and then apply patches that wouldn’t cause the software to run slowly or shut down altogether.

To test the limits of these Cyber Reasoning Systems, DARPA planted software bugs that were simplified versions of famous malware such as the Morris worm and the Heartbleed bug. Scores were based on how quickly and effectively the bots deployed patches and verified competitors’ patches, and bots lost points if their patches slowed down the software. “If you fix the bug but it takes 10 hours to run something that should have taken 5 minutes, that's not really useful,” explains Corbin Souffrant, a Raytheon cyber engineer.

Members of the Deep Red team described how their system accomplished this in five basic steps: First, their machine (named Rubeus) used a technique called fuzzing to overload the program with data and cause it to crash. Then, it scanned the crash results to identify potential flaws in the program’s code. Next, it verified these flaws and looked for potential patches in a database of known bugs and appropriate fixes. It chose a patch from this repository and applied it, and then analyzed the results to see if it helped. For each patch, the system used artificial intelligence to compare its solution with the results and determine how it should fix similar patches in the future.

During the live competition, some bugs proved more difficult for the machines to handle than others. Several machines found and patched an SQL Slammer-like vulnerability within 5 minutes, garnering applause. But only two teams managed to repair an imitation crackaddr bug in SendMail. And one bot, Xandra by the TECHx team, found a bug that the organizers hadn’t even intended to create.

Whether humans or machines, it’s always nice to see vanquished competitors exhibit good sportsmanship in the face of a loss. As the night wound down, Mechanical Phish politely congratulated Mayhem on its first place finish over the bots’ Twitter accounts.

The Conversation (0)