DeepMind Deploys Self-taught Agents To Beat Humans at Quake III

Without instructions, software agents learn how to crush human players at “Capture the Flag” in Quake III Arena

3 min read
Illustration of agents playing Capture the Flag, showing a range of behaviors
Image: DeepMind

Chess and Go were originally developed to mimic warfare, but they do a bad job of it. War and most other competitions generally involve more than one opponent and more than one ally, and the play typically unfolds not on an orderly, flat matrix but in a variety of landscapes built up in three dimensions.

That’s why Alphabet’s DeepMind, having crushed chess and Go, has now tackled the far harder challenge posed by the three-dimensional, multiplayer, first-person video game. Writing today in Science, lead author Max Jaderberg and 17 DeepMind colleagues describe how a totally unsupervised program of self-learning allowed software to exceed human performance in playing “Quake III Arena.” The experiment involved a version of the game that requires each of two teams to capture as many of the other teams’ flags as possible.

The teams begin at base camps set at opposite ends of a map, which is generated at random before each round. Players roam about, interacting with buildings, trees, hallways and other features on the map, as well as with allies and opponents. They try to use their laser-like weapons to “tag” members of the opposing team; a tagged player must drop any flag he might have been carrying on the spot and return to his team’s base.

DeepMind represents each player with a software agent that sees the same screen a human player would see. The agents have no way of knowing what other agents are seeing; again, this is a much closer approximation of real strategic contests than most board games provide. Each agent begins by making choices at random, but as evidence trickles in over successive iterations of the game, it is used in a process called reinforcement learning. The result is to cause the agent’s behavior to converge on a purposeful behavior pattern, called a “policy.”

Each agent develops its policy on its own, which means it can specialize a bit. However, there’s a limit: After every 1000 iterations of play the system compares policies and estimates how well the entire team would do if it were to mimic this or that agent. If one agent’s winning chances turn out to be less than 70 percent as high as another’s, the weaker agent copies the stronger one. Meanwhile, the reinforcement learning is itself tweaked by comparing it to other metrics. Such tweaking of the tweaker is known as meta-optimization.

Agents start out as blank slates, but they do have one feature built into their way of evaluating things. It’s called a multi–time scale recurrent neural network with external memory, and it keeps an eye not only on the score at the end of the game but also at earlier points. The researchers note that “Reward purely based on game outcome, such as win/draw/loss very sparse and delayed, resulting in no learning. Hence, we obtain more frequent rewards by considering the game points stream.”

The program generally beats human players when starting from a randomly generated position. Even after the humans had practiced for a total of 12 hours, they still were able to win just 25 percent of the games, drawing 6 percent of the time, and losing the rest.

However, when two professional game testers were given a particularly complex map that had not been used in training and were allowed to play games on that map against two software agents, the pros needed just 6 hours of training to come out on top. This result was not described in the Science paper but in a supplementary document made available to the press. The pros used their in-depth study of the map to identify the routes that the agents preferred and to work out how to avoid those routes.

So for the time being people can still beat software in a well-studied set-piece battle. Of course, real life rarely provides such opportunities. Robert E. Lee got to fight the Battle of Gettysburg just one time.

The Conversation (0)

Will AI Steal Submarines’ Stealth?

Better detection will make the oceans transparent—and perhaps doom mutually assured destruction

11 min read
A photo of a submarine in the water under a partly cloudy sky.

The Virginia-class fast attack submarine USS Virginia cruises through the Mediterranean in 2010. Back then, it could effectively disappear just by diving.

U.S. Navy

Submarines are valued primarily for their ability to hide. The assurance that submarines would likely survive the first missile strike in a nuclear war and thus be able to respond by launching missiles in a second strike is key to the strategy of deterrence known as mutually assured destruction. Any new technology that might render the oceans effectively transparent, making it trivial to spot lurking submarines, could thus undermine the peace of the world. For nearly a century, naval engineers have striven to develop ever-faster, ever-quieter submarines. But they have worked just as hard at advancing a wide array of radar, sonar, and other technologies designed to detect, target, and eliminate enemy submarines.

The balance seemed to turn with the emergence of nuclear-powered submarines in the early 1960s. In a 2015 study for the Center for Strategic and Budgetary Assessment, Bryan Clark, a naval specialist now at the Hudson Institute, noted that the ability of these boats to remain submerged for long periods of time made them “nearly impossible to find with radar and active sonar.” But even these stealthy submarines produce subtle, very-low-frequency noises that can be picked up from far away by networks of acoustic hydrophone arrays mounted to the seafloor.

Keep Reading ↓Show less