Build a Mini LLM from Scratch: Step‑by‑Step Guide to Tokenizer, Attention, and Transformer

This article walks through constructing a small large‑language model from the ground up, covering model architecture, tokenization methods, BPE vocabulary building, embedding, positional encoding, attention mechanisms, multi‑head attention, transformer blocks, training pipelines, inference, and sampling strategies, all with runnable Python code.

Tencent Technical Engineering
Tencent Technical Engineering
Tencent Technical Engineering
Build a Mini LLM from Scratch: Step‑by‑Step Guide to Tokenizer, Attention, and Transformer

1. Model Overview

The implementation follows the GPT‑2 architecture: an embedding layer, positional embeddings, a stack of transformer blocks, and a final linear head that projects back to the vocabulary.

HelloWorld Example

A minimal class QdogBaby demonstrates a rule‑based replacement, illustrating why a real LLM must predict token probabilities instead of performing fixed string substitutions.

class QdogBaby:
    def chat(self, text):
        if text.endswith('吗?'):
            return text.replace('吗?', '!')
        return text

model = QdogBaby()
print(model.chat('会说话吗?'))
print(model.chat('是人工智能吗?'))

2. Tokenizer

Tokenization converts raw text into discrete units (tokens) that the model can process. Three common strategies are:

Character‑level : splits every character (e.g., [T, o, d, a, y, …]). Small vocabulary but long sequences.

Word‑level : splits on whitespace (e.g., [Today, is, sunday, .]). Shorter sequences but large vocabularies and OOV issues.

Subword‑level : balances the two by keeping frequent words whole and breaking rare words into meaningful fragments (e.g., [To, day, is, sun, day, .]). This is the dominant choice for modern LLMs.

BPE Algorithm

Byte‑Pair Encoding iteratively merges the most frequent adjacent symbol pairs until a target vocabulary size is reached.

# Initial vocabulary
V = {a, b, e, g, l, n, p, r, s, </w>}
# First merge (a,p) → ap
V = {ap, a, b, e, g, l, n, p, r, s, </w>}

3. Encoding

After tokenization each token receives a unique integer ID from a lookup table. For example, the sentence "today is sunday" becomes [1024, 2046, 1025, 2047, 2046], a numeric sequence ready for the model.

4. Embedding & Positional Encoding

Token IDs are lifted into high‑dimensional vectors via an embedding matrix (e.g., vocab size 6400, dimension 512). Positional embeddings are added so the model can distinguish different word orders.

import torch
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./tokenizer")
input_ids = tokenizer.encode("我喜欢小企鹅", return_tensors='pt')

tok_emb = torch.nn.Embedding(num_embeddings=tokenizer.vocab_size, embedding_dim=512)
pos_emb = torch.nn.Embedding(num_embeddings=512, embedding_dim=512)

output = tok_emb(input_ids) + pos_emb(torch.arange(input_ids.shape[1]))
print(output)

5. Attention Mechanism

Attention computes similarity between every pair of token vectors using scaled dot‑product:

# Naïve implementation (slow)
attn_scores = torch.empty(seq_len, seq_len)
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        dot = 0.0
        for k in range(len(x_i)):
            dot += x_i[k] * x_j[k]
        attn_scores[i, j] = dot

# Efficient matrix version
attn_scores = inputs @ inputs.T
attn_weights = torch.softmax(attn_scores / math.sqrt(dim), dim=-1)
context = attn_weights @ inputs

Masking prevents tokens from attending to future positions during generation, and dropout can be applied to the attention weights to reduce over‑fitting.

Multi‑Head Attention

Eight parallel heads capture different relational patterns. Each head linearly projects the input into queries, keys, and values, performs scaled dot‑product attention, concatenates the results, and applies a final linear projection.

class MultiHeadAttention(nn.Module):
    def __init__(self, emb_dim, num_heads, dropout, context_size):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = emb_dim // num_heads
        self.W_query = nn.Linear(emb_dim, emb_dim)
        self.W_key   = nn.Linear(emb_dim, emb_dim)
        self.W_value = nn.Linear(emb_dim, emb_dim)
        self.out     = nn.Linear(emb_dim, emb_dim)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer("mask", torch.tril(torch.ones(context_size, context_size)))
    def forward(self, x):
        B, T, D = x.shape
        Q = self.W_query(x).view(B, T, self.num_heads, self.head_dim).transpose(1,2)
        K = self.W_key(x).view(B, T, self.num_heads, self.head_dim).transpose(1,2)
        V = self.W_value(x).view(B, T, self.num_heads, self.head_dim).transpose(1,2)
        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.head_dim)
        scores = scores.masked_fill(self.mask[:T, :T] == 0, float('-inf'))
        attn = torch.softmax(scores, dim=-1)
        attn = self.dropout(attn)
        context = (attn @ V).transpose(1,2).contiguous().view(B, T, D)
        return self.out(context)

6. Transformer Block

Each block consists of a pre‑norm layer, multi‑head attention, residual connection, another pre‑norm, and a feed‑forward network (linear → GELU → linear). Residual links preserve gradients across many layers.

class TransformerBlock(nn.Module):
    def __init__(self, emb_dim, num_heads, dropout, context_size):
        super().__init__()
        self.attn = MultiHeadAttention(emb_dim, num_heads, dropout, context_size)
        self.ln1  = LayerNorm(emb_dim)
        self.ff   = FeedForward(emb_dim, dropout)
        self.ln2  = LayerNorm(emb_dim)
    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x

Feed‑Forward Network

class FeedForward(nn.Module):
    def __init__(self, emb_dim, dropout):
        super().__init__()
        self.linear1 = nn.Linear(emb_dim, emb_dim*4)
        self.linear2 = nn.Linear(emb_dim*4, emb_dim)
    def forward(self, x):
        return self.linear2(nn.GELU()(self.linear1(x)))

Layer Normalization

class LayerNorm(nn.Module):
    def __init__(self, emb_dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))
    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        var  = x.var(-1, keepdim=True, unbiased=False)
        norm = (x - mean) / torch.sqrt(var + self.eps)
        return norm * self.scale + self.shift

7. Full Model Definition

class QdogBabyLearnConfig:
    def __init__(self):
        self.model_name = "qdogbabylearn"
        self.version = "1.0.0"
        self.num_hidden_layers = 16
        self.num_heads = 8
        self.emb_dim = 512
        self.dropout = 0.0
        self.context_size = 512
        self.vocab_size = 6400

class QdogBabyLearnLLM(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.emb_dim)
        self.pos_emb = nn.Embedding(cfg.context_size, cfg.emb_dim)
        self.dropout = nn.Dropout(cfg.dropout)
        self.blocks = nn.Sequential(*[TransformerBlock(cfg.emb_dim, cfg.num_heads, cfg.dropout, cfg.context_size) for _ in range(cfg.num_hidden_layers)])
        self.norm = LayerNorm(cfg.emb_dim)
        self.out = nn.Linear(cfg.emb_dim, cfg.vocab_size)
    def forward(self, x):
        x = self.tok_emb(x) + self.pos_emb(torch.arange(x.shape[1], device=x.device))
        x = self.dropout(x)
        x = self.blocks(x)
        x = self.norm(x)
        return self.out(x)

8. Training & Inference

During inference the model receives token IDs, produces logits of shape [batch, seq_len, vocab_size], and the argmax token at each position is decoded back to text. Because the model is initially untrained the output is nonsensical.

# Inference example
inputs = tokenizer.encode("QQ浏览器广告后台开发", return_tensors='pt')
logits = model(inputs)
pred = torch.argmax(logits, dim=-1)
print(tokenizer.decode(pred[0]))

Training constructs input‑target pairs by shifting the token sequence by one position, computes cross‑entropy loss, and back‑propagates.

# Simple training step
data = torch.tensor([51,51,586,240,6262,1179,5046,799,2507,3158,1335])
input_ids = data[:-1].unsqueeze(0)   # shape [1, seq_len]
target_ids = data[1:].unsqueeze(0)
logits = model(input_ids)
loss = nn.CrossEntropyLoss()(logits.view(-1, cfg.vocab_size), target_ids.view(-1))
loss.backward()

Supervised fine‑tuning (SFT) adds special tokens such as <|im_start|> to format dialogues and computes loss only on the assistant’s response.

9. Sampling Strategies

To generate diverse text, logits can be scaled by a temperature before softmax (lower temperature → sharper distribution, higher temperature → flatter distribution). Top‑k sampling restricts selection to the k most probable tokens.

probas = torch.tensor([0.1145, 0.1245, 0.5130, 0.1887, 0.0694])
print(torch.softmax(probas, dim=-1))               # default
print(torch.softmax(probas/0.5, dim=-1))            # higher temperature
print(torch.softmax(probas/0.1, dim=-1))            # low temperature
print(torch.topk(probas, k=4))

10. Demonstration

An untrained model behaves like a pure continuation machine, while an SFT‑trained model can answer questions in a conversational style.

11. References

https://learning.oreilly.com/library/view/build-a-large/9781633437166/

https://cs336.stanford.edu/

https://jalammar.github.io/illustrated-transformer/

https://jalammar.github.io/illustrated-gpt2/

https://arxiv.org/abs/1706.03762

https://github.com/jingyaogong/minimind

Pythondeep learningLLMTransformertokenizer
Tencent Technical Engineering
Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.