Building a Chess AI from Scratch: Combining AlphaZero and Transformers (Part 2)
This article walks through constructing a learnable chess AI by integrating AlphaZero‑style Monte Carlo Tree Search with a decoder‑only Transformer, detailing the game tree logic, model architecture, input and output encodings, self‑play training loop, and code implementation in PyTorch.
Introduction
The article continues from the previous post that built a playable chess engine and now designs a learnable AI. It references historic chess AIs such as Deep Blue and Stockfish and recent deep‑learning approaches like AlphaZero.
Monte Carlo Tree Search (MCTS)
MCTS explores the game tree by recursively selecting actions with the highest Upper Confidence Bound (UCB). The algorithm checks whether a node is a leaf (game over) and returns the result (white win, draw, black win). If the node is unseen, a neural network predicts a value v and a policy vector p. The UCB formula balances exploitation (value) and exploration (policy weighted by a hyper‑parameter c).
Leaf outcomes are encoded as:
sw = white win (0 or 1)
st = draw (0 or 1)
sb = black win (0 or 1)
When revisiting a node, the algorithm back‑propagates the predicted value and policy to update ancestor statistics.
Model Architecture
The model receives the encoded board state and outputs two heads: a value head v (probability distribution over game results) and a policy head p (probability distribution over legal moves). The core component is a decoder‑only Transformer that uses multi‑head self‑attention.
Attention
Standard attention computes queries Q, keys K, and values V from the same input embedding E. Multi‑head attention splits the input into several heads, processes them in parallel, and concatenates the results before a linear projection.
from einops import rearrange
import torch
import torch.nn as nn
class Attention(nn.Module):
def __init__(self, input_size, layer_size=64, latent_size=None, heads=1, dropout=0.5, output=False):
super(Attention, self).__init__()
self.heads = heads
self.Q = nn.Linear(input_size, layer_size if latent_size is None else latent_size, bias=False)
self.K = nn.Linear(input_size, layer_size, bias=False)
self.V = nn.Linear(input_size, layer_size, bias=False)
self.softmax = nn.Softmax(dim=-1)
inner = layer_size if latent_size is None else latent_size
self.output = nn.Sequential(nn.Linear(inner, input_size, bias=False), nn.Dropout(dropout))
def forward(self, x, context=None, mask=None):
q = rearrange(self.Q(x), 'b y (h x) -> (b h) y x', h=self.heads)
if context is None:
k, v = map(lambda t: rearrange(t, 'b y (h x) -> (b h) y x', h=self.heads), (self.K(x), self.V(x)))
else:
k, v = map(lambda t: rearrange(t, 'b y (h x) -> (b h) y x', h=self.heads), (self.K(context), self.V(context)))
z = torch.einsum('b y q, b y k -> b k q', q, k) / (q.size(-1) ** 0.5)
if mask is not None:
z = z.masked_fill(mask.unsqueeze(1).expand_as(z), -1e18)
z = self.softmax(z)
z = torch.einsum('b y z, b v y -> b v z', z, v)
z = rearrange(z, '(b h) y x -> b y (h x)', h=self.heads)
if output:
z = self.output(z)
return zDecoder‑Only Transformer
The implementation defines a Transformer class that stacks the Attention module, normalizes the output, and applies a linear projection.
class Transformer(nn.Module):
def __init__(self, input_size, layer_size=64, heads=1, dropout=0.5):
super(Transformer, self).__init__()
self.self_attention = Attention(input_size, layer_size=layer_size, heads=heads, dropout=dropout)
self.linear = nn.Sequential(nn.Linear(input_size, input_size, bias=False), nn.Dropout(dropout))
def forward(self, x):
z = self.self_attention(x)
z = nn.functional.normalize(z)
return self.linear(z)After the Transformer block, a simple linear layer maps the hidden representation to a 1‑dimensional value head and a 4096‑dimensional policy head (64 × 64 possible moves).
Input Encoding
The board state is tokenized: each piece is represented by a unique token (e.g., "Pw" for white pawn, "Pb" for black pawn) and empty squares are padded. The token sequence is prefixed with a side‑to‑move token (1 for white, 2 for black). Positional encoding is added to retain square order.
def encode_state(self, game):
temp_board = deepcopy(game.board)
for y, row in enumerate(temp_board):
for x, piece in enumerate(row):
if piece != 0:
temp_board[y][x] = f"{self.notation[abs(piece)]}{'w' if piece > 0 else 'b'}"
else:
temp_board[y][x] = 'PAD'
flat = [x for row in temp_board for x in row]
result = [self.token_bank['token'].eq(t).idxmax() for t in flat]
result.insert(0, 1 if game.p_move == 1 else 2)
return torch.tensor([result])Positional encoding follows the sinusoidal formulation from the original Transformer paper.
class PositionalEncoding(nn.Module):
def __init__(self, d_model, dropout=0.1, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe.unsqueeze(0).transpose(0, 1))
def forward(self, x):
x = x + self.pe[:x.size(0), :]
return self.dropout(x)Output Encoding
The network produces two softmaxed vectors: v (probability of white win, draw, black win) and p (probability over the 4096 possible move squares). One‑hot encoding is used for the target policy during training.
Training via Self‑Play
Following AlphaZero, the AI learns solely from games it plays against itself. In each episode, two agents (active model and new model) alternate colors, perform MCTS‑guided moves, and record state‑action pairs. After a game ends, the collected data are labeled with the final game outcome and used to train both heads with binary cross‑entropy loss.
def play_game(game_name, epoch, train=False, white='ai', black='ai', active_model='model-active.pth.tar', new_model='model-new.pth.tar', search_amount=50, max_depth=5, best_of=5):
# ... (omitted for brevity) ...
return state, game_train_data, a_colourTraining steps:
Sample a batch from the accumulated data.
Forward pass through TransformerModel to obtain v and p.
Compute v_loss and p_loss with BCELoss, sum them, back‑propagate, and update parameters with SGD.
Periodically replace the active model with the newly trained model if it wins > 51 % of matches in a best‑of series.
Conclusion
The article demonstrates a complete pipeline for building a chess AI that learns from self‑play using Monte Carlo Tree Search and a decoder‑only Transformer, providing detailed code, mathematical formulas, and training procedures.
References:
https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)
https://en.wikipedia.org/wiki/Stockfish_(chess)
https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go
https://en.wikipedia.org/wiki/Monte_Carlo_tree_search
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/abs/1706.03762
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
