Build a Mini LLM from Scratch: Step‑by‑Step Guide to Tokenizer, Attention, and Transformer
This article walks through constructing a small large‑language model from the ground up, covering model architecture, tokenization methods, BPE vocabulary building, embedding, positional encoding, attention mechanisms, multi‑head attention, transformer blocks, training pipelines, inference, and sampling strategies, all with runnable Python code.
1. Model Overview
The implementation follows the GPT‑2 architecture: an embedding layer, positional embeddings, a stack of transformer blocks, and a final linear head that projects back to the vocabulary.
HelloWorld Example
A minimal class QdogBaby demonstrates a rule‑based replacement, illustrating why a real LLM must predict token probabilities instead of performing fixed string substitutions.
class QdogBaby:
def chat(self, text):
if text.endswith('吗?'):
return text.replace('吗?', '!')
return text
model = QdogBaby()
print(model.chat('会说话吗?'))
print(model.chat('是人工智能吗?'))2. Tokenizer
Tokenization converts raw text into discrete units (tokens) that the model can process. Three common strategies are:
Character‑level : splits every character (e.g., [T, o, d, a, y, …]). Small vocabulary but long sequences.
Word‑level : splits on whitespace (e.g., [Today, is, sunday, .]). Shorter sequences but large vocabularies and OOV issues.
Subword‑level : balances the two by keeping frequent words whole and breaking rare words into meaningful fragments (e.g., [To, day, is, sun, day, .]). This is the dominant choice for modern LLMs.
BPE Algorithm
Byte‑Pair Encoding iteratively merges the most frequent adjacent symbol pairs until a target vocabulary size is reached.
# Initial vocabulary
V = {a, b, e, g, l, n, p, r, s, </w>}
# First merge (a,p) → ap
V = {ap, a, b, e, g, l, n, p, r, s, </w>}3. Encoding
After tokenization each token receives a unique integer ID from a lookup table. For example, the sentence "today is sunday" becomes [1024, 2046, 1025, 2047, 2046], a numeric sequence ready for the model.
4. Embedding & Positional Encoding
Token IDs are lifted into high‑dimensional vectors via an embedding matrix (e.g., vocab size 6400, dimension 512). Positional embeddings are added so the model can distinguish different word orders.
import torch
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./tokenizer")
input_ids = tokenizer.encode("我喜欢小企鹅", return_tensors='pt')
tok_emb = torch.nn.Embedding(num_embeddings=tokenizer.vocab_size, embedding_dim=512)
pos_emb = torch.nn.Embedding(num_embeddings=512, embedding_dim=512)
output = tok_emb(input_ids) + pos_emb(torch.arange(input_ids.shape[1]))
print(output)5. Attention Mechanism
Attention computes similarity between every pair of token vectors using scaled dot‑product:
# Naïve implementation (slow)
attn_scores = torch.empty(seq_len, seq_len)
for i, x_i in enumerate(inputs):
for j, x_j in enumerate(inputs):
dot = 0.0
for k in range(len(x_i)):
dot += x_i[k] * x_j[k]
attn_scores[i, j] = dot
# Efficient matrix version
attn_scores = inputs @ inputs.T
attn_weights = torch.softmax(attn_scores / math.sqrt(dim), dim=-1)
context = attn_weights @ inputsMasking prevents tokens from attending to future positions during generation, and dropout can be applied to the attention weights to reduce over‑fitting.
Multi‑Head Attention
Eight parallel heads capture different relational patterns. Each head linearly projects the input into queries, keys, and values, performs scaled dot‑product attention, concatenates the results, and applies a final linear projection.
class MultiHeadAttention(nn.Module):
def __init__(self, emb_dim, num_heads, dropout, context_size):
super().__init__()
self.num_heads = num_heads
self.head_dim = emb_dim // num_heads
self.W_query = nn.Linear(emb_dim, emb_dim)
self.W_key = nn.Linear(emb_dim, emb_dim)
self.W_value = nn.Linear(emb_dim, emb_dim)
self.out = nn.Linear(emb_dim, emb_dim)
self.dropout = nn.Dropout(dropout)
self.register_buffer("mask", torch.tril(torch.ones(context_size, context_size)))
def forward(self, x):
B, T, D = x.shape
Q = self.W_query(x).view(B, T, self.num_heads, self.head_dim).transpose(1,2)
K = self.W_key(x).view(B, T, self.num_heads, self.head_dim).transpose(1,2)
V = self.W_value(x).view(B, T, self.num_heads, self.head_dim).transpose(1,2)
scores = Q @ K.transpose(-2, -1) / math.sqrt(self.head_dim)
scores = scores.masked_fill(self.mask[:T, :T] == 0, float('-inf'))
attn = torch.softmax(scores, dim=-1)
attn = self.dropout(attn)
context = (attn @ V).transpose(1,2).contiguous().view(B, T, D)
return self.out(context)6. Transformer Block
Each block consists of a pre‑norm layer, multi‑head attention, residual connection, another pre‑norm, and a feed‑forward network (linear → GELU → linear). Residual links preserve gradients across many layers.
class TransformerBlock(nn.Module):
def __init__(self, emb_dim, num_heads, dropout, context_size):
super().__init__()
self.attn = MultiHeadAttention(emb_dim, num_heads, dropout, context_size)
self.ln1 = LayerNorm(emb_dim)
self.ff = FeedForward(emb_dim, dropout)
self.ln2 = LayerNorm(emb_dim)
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.ff(self.ln2(x))
return xFeed‑Forward Network
class FeedForward(nn.Module):
def __init__(self, emb_dim, dropout):
super().__init__()
self.linear1 = nn.Linear(emb_dim, emb_dim*4)
self.linear2 = nn.Linear(emb_dim*4, emb_dim)
def forward(self, x):
return self.linear2(nn.GELU()(self.linear1(x)))Layer Normalization
class LayerNorm(nn.Module):
def __init__(self, emb_dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(-1, keepdim=True)
var = x.var(-1, keepdim=True, unbiased=False)
norm = (x - mean) / torch.sqrt(var + self.eps)
return norm * self.scale + self.shift7. Full Model Definition
class QdogBabyLearnConfig:
def __init__(self):
self.model_name = "qdogbabylearn"
self.version = "1.0.0"
self.num_hidden_layers = 16
self.num_heads = 8
self.emb_dim = 512
self.dropout = 0.0
self.context_size = 512
self.vocab_size = 6400
class QdogBabyLearnLLM(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.emb_dim)
self.pos_emb = nn.Embedding(cfg.context_size, cfg.emb_dim)
self.dropout = nn.Dropout(cfg.dropout)
self.blocks = nn.Sequential(*[TransformerBlock(cfg.emb_dim, cfg.num_heads, cfg.dropout, cfg.context_size) for _ in range(cfg.num_hidden_layers)])
self.norm = LayerNorm(cfg.emb_dim)
self.out = nn.Linear(cfg.emb_dim, cfg.vocab_size)
def forward(self, x):
x = self.tok_emb(x) + self.pos_emb(torch.arange(x.shape[1], device=x.device))
x = self.dropout(x)
x = self.blocks(x)
x = self.norm(x)
return self.out(x)8. Training & Inference
During inference the model receives token IDs, produces logits of shape [batch, seq_len, vocab_size], and the argmax token at each position is decoded back to text. Because the model is initially untrained the output is nonsensical.
# Inference example
inputs = tokenizer.encode("QQ浏览器广告后台开发", return_tensors='pt')
logits = model(inputs)
pred = torch.argmax(logits, dim=-1)
print(tokenizer.decode(pred[0]))Training constructs input‑target pairs by shifting the token sequence by one position, computes cross‑entropy loss, and back‑propagates.
# Simple training step
data = torch.tensor([51,51,586,240,6262,1179,5046,799,2507,3158,1335])
input_ids = data[:-1].unsqueeze(0) # shape [1, seq_len]
target_ids = data[1:].unsqueeze(0)
logits = model(input_ids)
loss = nn.CrossEntropyLoss()(logits.view(-1, cfg.vocab_size), target_ids.view(-1))
loss.backward()Supervised fine‑tuning (SFT) adds special tokens such as <|im_start|> to format dialogues and computes loss only on the assistant’s response.
9. Sampling Strategies
To generate diverse text, logits can be scaled by a temperature before softmax (lower temperature → sharper distribution, higher temperature → flatter distribution). Top‑k sampling restricts selection to the k most probable tokens.
probas = torch.tensor([0.1145, 0.1245, 0.5130, 0.1887, 0.0694])
print(torch.softmax(probas, dim=-1)) # default
print(torch.softmax(probas/0.5, dim=-1)) # higher temperature
print(torch.softmax(probas/0.1, dim=-1)) # low temperature
print(torch.topk(probas, k=4))10. Demonstration
An untrained model behaves like a pure continuation machine, while an SFT‑trained model can answer questions in a conversational style.
11. References
https://learning.oreilly.com/library/view/build-a-large/9781633437166/
https://cs336.stanford.edu/
https://jalammar.github.io/illustrated-transformer/
https://jalammar.github.io/illustrated-gpt2/
https://arxiv.org/abs/1706.03762
https://github.com/jingyaogong/minimind
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
