How AlphaGo’s Four‑Component Architecture Powers Master‑Level Go Play
This article breaks down AlphaGo’s four‑part system—policy network, fast rollout, value network, and Monte Carlo Tree Search—explaining their functions, training methods, and how they combine to achieve professional‑grade Go performance, while comparing them with the DarkForest implementation.
AlphaGo System Overview
AlphaGo consists of four main components: a policy network that predicts the next move, a fast rollout that sacrifices some move quality for speed, a value network that estimates win probabilities, and Monte Carlo Tree Search (MCTS) that integrates the three parts into a complete system.
Policy Network
The policy network takes the current board position as input and outputs a score for every possible move (361 points on a 19×19 board). DarkForest improves on this by training to predict three moves ahead, achieving quality comparable to reinforcement‑learning (RL) networks, though the final system uses a supervised‑learning (SL) network for better move diversity.
AlphaGo uses a relatively narrow (192‑unit) network for speed; a wider (384‑unit) network would likely be stronger if GPU resources allowed.
Fast Rollout
Fast rollout runs at microsecond speed, about 1,000× faster than the policy network, keeping the CPU busy while waiting for the network’s move. It also provides board evaluation by simulating games to the end, trading off simulation quality for quantity to improve overall strength.
AlphaGo implements fast rollout using local pattern matching and logistic regression, achieving 2 µs per move with a 24.2% move‑prediction accuracy, compared to 57% accuracy for the policy network at 2 ms.
Value Network
The value network estimates the win probability of the current position. Although it adds roughly 480 Elo points, the policy network contributes 800–1,000 points. Training requires 30 million self‑play games, sampling one position per game to avoid over‑fitting.
Surprisingly, AlphaGo does not use explicit local life‑and‑death analysis; the deep convolutional network learns to approximate these evaluations automatically.
Monte Carlo Tree Search (MCTS)
MCTS combines the three components, using a prior‑guided UCT that first expands moves favored by the policy network and later explores less‑promising moves as needed. DarkForest selects the top 3–5 policy moves for search, achieving similar performance.
AlphaGo expands leaf nodes only after a visit count threshold (e.g., 40), conserving GPU resources and improving leaf evaluation accuracy.
Summary and Insights
The success of AlphaGo stems from the systematic integration of deep learning components and traditional search, not from a single breakthrough. Reinforcement learning mainly supplies high‑quality training data rather than directly improving play. The system still relies heavily on massive data and computational resources.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
