AlphaGo, a largely self-taught Go-playing AI, last night won the fifth and final game in a match held in Seoul, South Korea, against that country’s Lee Sedol. Sedol is one of the greatest modern players of the ancient Chinese game. The final score was 4 games to 1.
Thus falls the last and computationally hardest game that programmers have taken as a test of machine intelligence. Chess, AI’s original touchstone, fell to the machines 19 years ago, but Go had been expected to last for many years to come.
The sweeping victory means far more than the US $1 million prize, which Google’s London-based acquisition, DeepMind, says it will give to charity. That’s because AlphaGo, for all its processing power, mainly owes its victory to a radical new way of using that power: via deep neural networks. These networks can train themselves with only a little intervention from human beings, and DeepMind’s researchers had already demonstrated that they can master a wide range of computer video games. The researchers hope that this generalizability can be carried over to the mastering of practical tasks in many other domains, including medicine and robotics.
Game programming began with chess, using methods first sketched out by Claude Shannon and Alan Turing in the 1940s. A machine calculates every possible continuation for each side, working its way as many moves ahead as it can and so generating a tree of analysis with millions of game positions. It then grades the positions by applying rules of thumb that even beginning chess players know, such as the differing values of the various pieces and the importance of controlling the center of the board. Finally, the algorithm traces its way from those end positions back to the current position to find the move that leads to the best outcome, assuming perfect play on both sides.
With modern hardware, this “brute-force” method can produce a strong chess-playing program. Add a grab-bag of tricks to “prune” the analysis tree, throwing out bad lines so the program can explore promising lines more deeply, and you get world-champion-level play. That came in 1997, when IBM’s Deep Blue supercomputer defeated then-World Chess Champion Garry Kasparov. Today you can download a US $100 program that plays even better—on a laptop.
Though some researchers have argued for some time that brute-force searching can in principle conquer Go, the game has long resisted such efforts. Compared to chess, the Chinese game offers far more moves in a given position and far more moves in a typical game, creating an intractably huge tree of analysis. It also lacks reliable rules of thumb for the grading of positions.
In recent years, many programmers have tried to get around this problem with Monte Carlo simulation, a statistical means of finding the best first move from a vast database of the games that might begin from a given position. That method is also used a bit in AlphaGo, together with the tree-generating methods of yore. But the key improvement is AlphaGo’s use of deep neural networks to recognize patterns.
At a quiet moment, 42 minutes into the streaming of the match’s second game, on 10 March, one of the online commenters, Google’s Thore Graepel, described his first over-the-board encounter with an early form of AlphaGo a year ago—on his first day of work at Deep Mind’s London office. “I thought, neural network, how difficult can it be? It cannot even do reading of positions, it just does pattern recognitions,” Graepel said. “I sat down in front of the board, a small crowd gathered round, and in a small time, my position started to deteriorate… I ended up losing, I tried again and lost again. At least at that point the office knew me, I had a good introduction!”
AlphaGo uses two neural networks, a policy network that was trained on millions of master games with the goal of imitating their play, and a value network, that tries to assign a winning probability to each given position. That way, the machine can focus its efforts on the most promising continuations. Then comes the tree-searching part, which tries to look many moves ahead.
“One way to think of it is that the policy network provides a guide, suggesting to AlphaGo moves to consider; but AlphaGo can then go on beyond that and come up with a new conclusion that overwhelms the suggestion by the policy network,” explained David Silver, the leader of the AlphaGo team, in online commentary last night, just before the final game. “At every part of the search tree, it’s using the policy network to suggest moves and the value network to evaluate moves. The policy network alone was enough to beat Graepel, an accomplished amateur player, on his first day in the office.”
A strange consequence of AlphaGo’s division of labor is the way it plays once it thinks it has a clearly winning game. A human player would normally try to win by the largest possible margin, by capturing not just one extra point on the board, but 10 or 20 points, if possible. That way, the human would be likely to win even if he later makes a small mistake. But AlphaGo prefers to win by one point, at what it considers a high probability, over winning, say, by 20 points, at a rather lower probability.
You might think that this tendency to go for the safe-but-slack move is what enabled Lee Sedol to win the fourth game, on Sunday. And indeed, commentators at the time noted that the machine seemed to have the upper hand when Sedol pounced with an unexpected move, after which the machine played some weak moves. Sedol had used up a lot of time on his clock and so had to scramble to make his following moves, but in the end he was able to sustain his advantage and finally win.
However it wasn’t slackness but sheer surprise that caused the problem, members of the DeepMind team said last night, in commentary before thev final game. “That crucial move that Lee found, move 78, was considered very unlikely to be played by a human—[the program] estimated a one in 10,000 chance,” said David Silver, the team leader. “It’s fair to say that AlphaGo was surprised by this move; it had to start replanning at that point.”
A human player faced with a strange-looking move would study it deeply, if there was enough time to do so—and AlphaGo had plenty of time. “But AlphaGo has a simple time-control strategy,” Silver noted. “Maybe that’s something we can work on in future.”
So, it seems, efforts to improve AlphaGo will continue.
“An exciting direction for future research is to consider whether a machine can learn completely by itself, without any human examples, to achieve this level of performance,” Silver said.
The final goal, of course, is to create an all-around learning machine, one that can learn to do a lot of things that people now get paid to do. Like, say, reporting and writing blog posts like this one.