DeepMind's New AI Masters Games Without Even Being Taught the Rules

It's the next step toward self-directed learning about the real world. Cue the shark music

3 min read
game computer graphic
Courtesy of Deepmind

The folks at DeepMind are pushing their methods one step further toward the dream of a machine that learns on its own, the way a child does.

The London-based company, a subsidiary of Alphabet, is officially publishing the research today, in Nature, although it tipped its hand back in November with a preprint in ArXiv. Only now, though, are the implications becoming clear: DeepMind is already looking into real-world applications.

DeepMind won fame in 2016 for AlphaGo, a reinforcement-learning system that beat the game of Go after training on millions of master-level games. In 2018 the company followed up with AlphaZero, which trained itself to beat Go, chess and Shogi, all without recourse to master games or advice. Now comes MuZero, which doesn't even need to be shown the rules of the game.

The new system tries first one action, then another, learning what the rules allow, at the same time noticing the rewards that are proffered—in chess, by delivering checkmate; in Pac-Man, by swallowing a yellow dot. It then alters its methods until it hits on a way to win such rewards more readily—that is, it improves its play. Such learning by observation is ideal for any AI that faces problems that can't be specified easily. In the messy real world—apart from the abstract purity of games—such problems abound.

“We’re exploring the application of MuZero to video compression, something that could not have been done with AlphaZero,” says Thomas Hubert, one of the dozen co-authors of the Nature article. 

“It’s because it would be very expensive to do it with AlphaZero,” adds Julian Schrittwieser, another co-author. 

Other applications under discussion are in self-driving cars (which in Alphabet is handled by its subsidiary, Waymo) and in protein design, the next step beyond protein folding (which sister program AlphaFold recently mastered). Here the goal might be to design a protein-based pharmaceutical that must act on something that is itself an actor, say a virus or a receptor on a cell’s surface.

By simultaneously learning the rules and improving its play, MuZero outdoes its DeepMind predecessors in the economical use of data. In the Atari game of Ms. Pac-Man, when MuZero was limited to considering six or seven simulations per move—“a number too small to cover all the available actions,” as DeepMind notes, in a statement—it still did quite well. 

The system takes a fair amount of computing muscle to train, but once trained, it needs so little processing to make its decisions that the entire operation might be managed on a smartphone. “And even the training isn’t so much,” says Schrittwieser. “An Atari game would take 2-3 weeks to train on a single GPU.”

One reason for the lean operation is that MuZero models only those aspects of its environment—in a game or in the world—that matter in the decision-making process. “After all, knowing an umbrella will keep you dry is more useful to know than modeling the pattern of raindrops in the air,” DeepMind notes, in a statement.

Knowing what’s important is important. Chess lore relates a story in which a famous grandmaster is asked how many moves ahead he looks. “Only one,” intones the champion, “but it is always the best.” That is, of course, an exaggeration, yet it holds a kernel of truth: Strong chessplayers generally examine lines of analysis that span only a few dozen positions, but they know at a glance which ones are worth looking at. 

Children can learn a general pattern after exposure to a very few instances—inferring Niagara from a drop of water, as it were. This astounding power of generalization has intrigued psychologists for generations; the linguist Noam Chomsky once argued that children had to be hard-wired with the basics of grammar because otherwise the “poverty of the stimulus” would have made it impossible for them to learn how to talk. Now, though, this idea is coming into question; maybe children really do glean much from very little.

Perhaps machines, too, are in the early stages of learning how to learn in that fashion. Cue the shark music!

The Conversation (0)

How the U.S. Army Is Turning Robots Into Team Players

Engineers battle the limits of deep learning for battlefield bots

11 min read
Robot with threads near a fallen branch

RoMan, the Army Research Laboratory's robotic manipulator, considers the best way to grasp and move a tree branch at the Adelphi Laboratory Center, in Maryland.

Evan Ackerman

This article is part of our special report on AI, “The Great AI Reckoning.

"I should probably not be standing this close," I think to myself, as the robot slowly approaches a large tree branch on the floor in front of me. It's not the size of the branch that makes me nervous—it's that the robot is operating autonomously, and that while I know what it's supposed to do, I'm not entirely sure what it will do. If everything works the way the roboticists at the U.S. Army Research Laboratory (ARL) in Adelphi, Md., expect, the robot will identify the branch, grasp it, and drag it out of the way. These folks know what they're doing, but I've spent enough time around robots that I take a small step backwards anyway.

The robot, named RoMan, for Robotic Manipulator, is about the size of a large lawn mower, with a tracked base that helps it handle most kinds of terrain. At the front, it has a squat torso equipped with cameras and depth sensors, as well as a pair of arms that were harvested from a prototype disaster-response robot originally developed at NASA's Jet Propulsion Laboratory for a DARPA robotics competition. RoMan's job today is roadway clearing, a multistep task that ARL wants the robot to complete as autonomously as possible. Instead of instructing the robot to grasp specific objects in specific ways and move them to specific places, the operators tell RoMan to "go clear a path." It's then up to the robot to make all the decisions necessary to achieve that objective.

Keep Reading ↓ Show less