DeepMind Deploys Self-taught Agents To Beat Humans at Quake III

Chess and Go were originally developed to mimic warfare, but they do a bad job of it. War and most other competitions generally involve more than one opponent and more than one ally, and the play typically unfolds not on an orderly, flat matrix but in a variety of landscapes built up in three dimensions.

That’s why Alphabet’s DeepMind, having crushed chess and Go, has now tackled the far harder challenge posed by the three-dimensional, multiplayer, first-person video game. Writing today in Science, lead author Max Jaderberg and 17 DeepMind colleagues describe how a totally unsupervised program of self-learning allowed software to exceed human performance in playing “Quake III Arena.” The experiment involved a version of the game that requires each of two teams to capture as many of the other teams’ flags as possible.

The teams begin at base camps set at opposite ends of a map, which is generated at random before each round. Players roam about, interacting with buildings, trees, hallways and other features on the map, as well as with allies and opponents. They try to use their laser-like weapons to “tag” members of the opposing team; a tagged player must drop any flag he might have been carrying on the spot and return to his team’s base.

DeepMind represents each player with a software agent that sees the same screen a human player would see. The agents have no way of knowing what other agents are seeing; again, this is a much closer approximation of real strategic contests than most board games provide. Each agent begins by making choices at random, but as evidence trickles in over successive iterations of the game, it is used in a process called reinforcement learning. The result is to cause the agent’s behavior to converge on a purposeful behavior pattern, called a “policy.” 

Each agent develops its policy on its own, which means it can specialize a bit. However, there’s a limit: After every 1000 iterations of play the system compares policies and estimates how well the entire team would do if it were to mimic this or that agent. If one agent’s winning chances turn out to be less than 70 percent as high as another’s, the weaker agent copies the stronger one. Meanwhile, the reinforcement learning is itself tweaked by comparing it to other metrics. Such tweaking of the tweaker is known as meta-optimization.

Agents start out as blank slates, but they do have one feature built into their way of evaluating things. It’s called a multi–time scale recurrent neural network with external memory, and it keeps an eye not only on the score at the end of the game but also at earlier points. The researchers note that “Reward purely based on game outcome, such as win/draw/loss signal...is very sparse and delayed, resulting in no learning. Hence, we obtain more frequent rewards by considering the game points stream.”

The program generally beats human players when starting from a randomly generated position. Even after the humans had practiced for a total of 12 hours, they still were able to win just 25 percent of the games, drawing 6 percent of the time, and losing the rest.

However, when two professional game testers were given a particularly complex map that had not been used in training and were allowed to play games on that map against two software agents, the pros needed just 6 hours of training to come out on top. This result was not described in the Science paper but in a supplementary document made available to the press. The pros used their in-depth study of the map to identify the routes that the agents preferred and to work out how to avoid those routes.

So for the time being people can still beat software in a well-studied set-piece battle. Of course, real life rarely provides such opportunities. Robert E. Lee got to fight the Battle of Gettysburg just one time.