Illustrated Transformer: Comprehensive Explanation and Code Implementation
This article provides a step‑by‑step illustrated guide to the Transformer architecture, covering its macro structure, detailed self‑attention mechanisms, multi‑head attention, positional encoding, residual connections, decoder operation, training process, loss functions, and includes complete PyTorch and custom Python code examples.
Preface
This translation of the illustrated Transformer article explains the model from input to output, adding original explanations and simple code for Self‑Attention and multi‑head attention matrix operations.
1. Macro Understanding of Transformer
The model treats the whole system as a black box that receives a source sentence and outputs a translated sentence. The architecture consists of an Encoder stack on the left and a Decoder stack on the right, each typically with six identical layers.
Each Encoder layer contains two sub‑layers: a Self‑Attention layer and a Feed‑Forward Neural Network (FFNN). The Decoder layers have an additional Encoder‑Decoder Attention sub‑layer.
2. Detailed Understanding of Transformer
2.1 Transformer Input
Words are first converted to embeddings (commonly 256 or 512 dimensions; the example uses 4‑dimensional vectors for simplicity). Sentences are padded or truncated to a fixed length.
2.2 Encoder
The Encoder receives a list of word vectors, processes them through Self‑Attention, then through the FFNN, and passes the result to the next Encoder layer. Each position follows its own computational path.
3. Self‑Attention Overview
Self‑Attention allows each word to attend to all other words in the sentence, enabling the model to capture dependencies such as pronoun references.
4. Self‑Attention Details
4.1 Compute Query, Key, Value Vectors
For each input word vector, three new vectors are created by multiplying with learned weight matrices W Q , W K , and W V . These vectors are typically lower‑dimensional than the original embedding.
4.2 Compute Attention Scores
The score for a word is the dot product between its Query vector and the Key vectors of all words, scaled, passed through Softmax, multiplied by the corresponding Value vectors, and summed.
5. Matrix Computation of Self‑Attention
All words are stacked into matrix X, then multiplied by weight matrices to obtain Q, K, V matrices. The attention computation is performed with matrix multiplications, enabling parallel computation for all positions.
6. Multi‑Head Attention
Multiple attention heads (e.g., 8) are created by projecting Q, K, V into separate sub‑spaces, computing attention in each head, and concatenating the results before a final linear projection.
7. Code Implementation of Attention
7.1 PyTorch Implementation
torch.nn.MultiheadAttention(embed_dim, num_heads, dropout=0.0, bias=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None)Key arguments include embed_dim (dimension of Q/K/V), num_heads (must divide embed_dim ), and optional masks.
7.2 Manual Implementation
class MultiheadAttention(nn.Module):
def __init__(self, hid_dim, n_heads, dropout):
super(MultiheadAttention, self).__init__()
self.hid_dim = hid_dim
self.n_heads = n_heads
assert hid_dim % n_heads == 0
self.w_q = nn.Linear(hid_dim, hid_dim)
self.w_k = nn.Linear(hid_dim, hid_dim)
self.w_v = nn.Linear(hid_dim, hid_dim)
self.fc = nn.Linear(hid_dim, hid_dim)
self.do = nn.Dropout(dropout)
self.scale = torch.sqrt(torch.FloatTensor([hid_dim // n_heads]))
def forward(self, query, key, value, mask=None):
bsz = query.shape[0]
Q = self.w_q(query)
K = self.w_k(key)
V = self.w_v(value)
Q = Q.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
K = K.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
V = V.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
attention = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
if mask is not None:
attention = attention.masked_fill(mask == 0, -1e10)
attention = self.do(torch.softmax(attention, dim=-1))
x = torch.matmul(attention, V)
x = x.permute(0, 2, 1, 3).contiguous()
x = x.view(bsz, -1, self.n_heads * (self.hid_dim // self.n_heads))
x = self.fc(x)
return x7.3 Key Code Snippet
# Split K, Q, V into multiple heads
Q = Q.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
K = K.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)
V = V.view(bsz, -1, self.n_heads, self.hid_dim // self.n_heads).permute(0, 2, 1, 3)8. Positional Encoding
Since the model has no recurrence, sinusoidal positional encodings are added to word embeddings to provide order information. The encoding uses sine for even dimensions and cosine for odd dimensions, allowing extrapolation to longer sequences.
9. Residual Connections
Each sub‑layer (Self‑Attention and FFNN) is wrapped with a residual connection followed by layer normalization, facilitating gradient flow and stable training.
10. Decoder
The Decoder mirrors the Encoder but adds a masked Self‑Attention (preventing attention to future positions) and an Encoder‑Decoder Attention that attends to the Encoder outputs.
11. Final Linear and Softmax Layers
The Decoder output is projected to the vocabulary size via a linear layer, then a Softmax converts logits to probabilities for word selection.
12. Training Process
During training, the model’s output distributions are compared to ground‑truth tokens using a loss function (e.g., cross‑entropy). The network is optimized via back‑propagation to minimize this loss.
13. Loss Function
Cross‑entropy (or KL‑divergence) measures the difference between predicted and true probability distributions, guiding the model to produce accurate translations.
Further Reading
Attention Is All You Need (https://arxiv.org/abs/1706.03762)
Transformer: A Novel Neural Network Architecture for Language Understanding (https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html)
Tensor2Tensor announcement (https://ai.googleblog.com/2017/06/accelerating-deep-learning-research.html)
Łukasz Kaiser’s talk (https://www.youtube.com/watch?v=rBCqOTEfxvg)
Tensor2Tensor Jupyter notebook (https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb)
Tensor2Tensor GitHub repository (https://github.com/tensorflow/tensor2tensor)
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.