Was AlphaGo's Move 37 Inevitable?
Jan 23, 2017 · 5 minute read · CommentsThis question is interesting to me both because of the way this particular move was reported on at the time, and because it works as a starting point for me to understand the inner workings of AlphaGo. I’m talking, of course, about the AI that beat the human Go champion, Lee Sedol, last March. The famous “move 37” happened in the second game of the 5-game match, and was described by commentators, once they got over their initial shock, with words like “beautiful” and “creative.” It left Sedol utterly flumoxed, to the point where he had to spend 15 minutes contemplating his own next move.
The question, which can be interpreted in different ways, is about randomness and probability. I’ve read the paper in Nature where the researchers explain the architecture of the system and how it was trained, and there are several places where there was randomness at play in AlphaGo, in addition to whatever randomness was at play at the deeper levels of training, e.g. stochastic gradient descent:
- Board positions from past human games used to train the policy network via supervised learning were randomly selected
- The output of this policy network is a probability distribution over actions that is sampled from during play
- The output of the faster rollout policy network is a probability distribution over actions that is sampled from during play
- Training of the second policy network where Reinforcement Learning (RL) is used and the system plays against previous iterations of itself uses randomly selected previous iterations of the policy network as opponents
- The output of the RL policy network is a probability distribution over actions that is sampled from during training of the value network. This value network outputs a simple scalar prediction of the outcome given a board position.
A naïve interpretation of the question might be as follows: if we train two different instances of AlphaGo, would they each inevitably make move 37, given the state of the board as it was? Here the answer is clearly No, because of the various sources of randomness that are introduced during training. Even if we hold fixed the first point above, so that the two instances are trained on the exact same set of human game data, the move is still not inevitable, due to the other sources of randomness during training. As explained below, the value of move 37 must have been learned during the self-play part of training, but it could have happened that such moves were never explored during self-play, so the high potential value of the move would not have been learned.
A more interesting version of the question is: given the AlphaGo instance exactly as it had been trained prior to the match with Lee Sedol, was it inevitable that it would make move 37? This holds fixed all of the randomness introduced during training so we only focus on the randomness during play. My understanding is that the sources of randomness introduced at this stage are the sampling of actions during Monte Carlo Tree Search (MCTS) simulations - both in sampling moves to evaluate and as part of the evaluation of each leaf node in the tree, when the fast rollout policy plays to the end of the game to deliver a prediction about the move.
Move 37 must have had a very high action value, because it had to overcome a very low probability of being played by a human. The action value comes partly from the value network and partly from the prediction of the fast rollout policy that plays to the end of the game. Since the fast rollout policy is trained on human board positions, the high value of this action must have been discovered during self play, and so it was the value network derived from the RL policy network that assigned the high value. The same value network would do so again. Perhaps the fast rollout policy predicted a positive outcome, in which case an interesting question is whether there were other possible routes the fast rollout policy could have taken that would have resulted in a negative outcome prediction and whether in that case the high value from the value network would have sufficed to overcome that.
And wasn’t there also the chance that this move never would have been sampled during simulation? Although the prior probability assigned by the SL policy network becomes less and less important with the number of visits made to an action, I didn’t find anything in the paper that suggests that every possible legal move is evaluated at least once by the value network.
There was a lot of talk in the aftermath of the match about how surprising move 37 was because it was a move that no human player would ever have made and yet was deemed in hindsight to have been “masterful.” Is this really so surprising though? The system played millions of games against itself after having been trained on millions of human games. Given the complexity of Go, with the number of possible board positions far exceeding the number of atoms in the universe, I’d have found it more surprising if this self-play had only led to mastery of the types of moves that had already been played by humans.
At any rate, it seems clear that move 37 in game 2 was not inevitable - it could easily not have been chosen. I leave judgements as to its beauty to those more familiar with the game of Go. As to whether or not it deserves to be called “creative,” as long as we’re ok with reducing the notion of creativity to sampling from probability distributions, I’d have to say: Why not? :)