How AlphaGo’s Deep Neural Networks Achieve Human‑Level Go Mastery

This article breaks down AlphaGo’s breakthrough architecture—four specialized neural‑network modules, Monte‑Carlo Tree Search, and deep reinforcement learning—to explain how the system moved from imitation learning to self‑improvement and ultimately defeated top human Go players.

dbaplus Community
dbaplus Community
dbaplus Community
How AlphaGo’s Deep Neural Networks Achieve Human‑Level Go Mastery

AlphaGo Neural Network Architecture

AlphaGo uses four deep neural networks:

Rollout Policy Network – a lightweight network that quickly generates move probabilities for fast simulations. It provides a rough evaluation with low accuracy.

Supervised‑Learning (SL) Policy Network – a 12‑layer convolutional network trained on ~30 million human moves from 6‑ to 9‑stone professional games. Input consists of 48 binary feature planes representing the 19×19 board (black stones, white stones, empty points, liberties, ko, etc.). The network outputs a probability distribution over the 361 possible moves. It achieves about 57 % top‑move accuracy against human experts.

Reinforcement‑Learning (RL) Policy Network – initialized from the SL network and further improved by self‑play using policy‑gradient reinforcement learning. After thousands of self‑play games the RL policy defeats the SL policy in roughly 80 % of matches.

Value Network – a separate 12‑layer network that takes the same board representation and predicts the probability of the current player eventually winning. Its mean‑square error on a held‑out validation set is 0.22–0.23, corresponding to an ≈80 % correct global assessment.

AlphaGo brain diagram
AlphaGo brain diagram

Monte‑Carlo Tree Search (MCTS) Integration

AlphaGo combines the networks with a Monte‑Carlo Tree Search. Each iteration of the search consists of the following steps:

Selection : Starting from the root position, traverse the tree using the PUCT formula, which incorporates the prior move probabilities from the SL (or RL) policy network.

Expansion : When a leaf node is reached, generate its child nodes by feeding the board state to the policy network and sampling the most promising moves up to a depth L (typically 5–10 moves).

Evaluation : For each newly expanded node compute two values:

Value‑network estimate V(s) – a global win‑rate prediction.

Rollout‑policy simulation – run a fast, shallow playout using the rollout network until the game terminates, then compute the outcome z.

Combine them, e.g. Q = (1‑α)·V(s) + α·z, where α balances the two signals.

Backup : Propagate the combined value Q back up the tree, updating visit counts N(s,a) and average value W(s,a). When the same move appears in multiple simulations its value is averaged, implementing the Monte‑Carlo averaging principle.

The loop is executed thousands of times (often >10 000 simulations per move). The move with the highest visit count N is selected as AlphaGo’s next play.

Monte Carlo Tree Search diagram
Monte Carlo Tree Search diagram

Training Procedure

Supervised learning phase – The SL policy network is trained with stochastic gradient descent on a dataset of 30 million positions extracted from professional games. The loss combines a cross‑entropy term for move prediction and L2 regularisation. After convergence the network reaches 57 % top‑move accuracy.

Reinforcement learning phase – Starting from the SL weights, AlphaGo generates self‑play games against earlier versions of itself. After each game the outcome (win = 1, loss = 0) is used as a reward signal. Policy‑gradient updates (REINFORCE with baseline) adjust the network parameters to increase the probability of moves that lead to wins. Over many generations the RL policy surpasses the SL policy in ≈80 % of games.

Value network training – The value network is trained on the same self‑play data, using the final game outcome as the target label. The loss is mean‑square error between the predicted win probability and the actual result. Validation error stabilises around 0.22–0.23.

Rollout policy – A shallow network (few convolutional layers) is trained to predict short‑term move sequences. It is used only for fast simulations during MCTS and matches human expert moves only 24.2 % of the time.

Empirical Performance

Rollout policy matches expert moves 24.2 % of the time.

SL policy matches experts 57.0 % of the time.

RL policy defeats the SL policy in ~80 % of games.

Value network mean‑square error 0.22–0.23, yielding ≈80 % correct global win‑rate predictions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep LearningNeural NetworksReinforcement LearningMonte Carlo Tree SearchAlphaGoGo AI
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.