Build a Minimal Large Language Model from Scratch with Python and PyTorch
This tutorial walks through creating a simple bigram language model in pure Python, refactoring it into a PyTorch implementation, and explains core concepts such as tokenization, embedding layers, loss functions, gradient descent, training loops, and text generation, preparing you for building a full GPT model.
Introduction
Large language models (LLMs) are popular, but many articles only discuss concepts superficially. This guide shows how to start from zero and implement a minimal yet complete LLM in Python, then refactor it with PyTorch, making the underlying mechanisms of self‑attention and transformers concrete.
Scope and Requirements
Only basic Python knowledge is needed.
The model will generate Chinese poetry (the dataset contains Song and Southern Tang poems).
Mathematical or machine‑learning theory is not explained in depth; the focus is on a working implementation.
Dataset
The dataset ci.txt contains lines of classical poems. A vocabulary is built from all unique characters, resulting in vocab_size = 6418. Each character is treated as a token.
Simple Bigram Model in Python
The first implementation counts character‑to‑character transitions and samples the next character proportionally to observed frequencies.
import random
random.seed(42) # remove for random results
prompt = "春江"
max_new_token = 100
with open('ci.txt', 'r', encoding='utf-8') as f:
text = f.read()
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
transition = [[0 for _ in range(vocab_size)] for _ in range(vocab_size)]
for i in range(len(text)-1):
cur = encode(text[i])[0]
nxt = encode(text[i+1])[0]
transition[cur][nxt] += 1
generated_token = encode(prompt)
for i in range(max_new_token-1):
cur = generated_token[-1]
logits = transition[cur]
total = sum(logits)
probs = [logit/total for logit in logits]
next_id = random.choices(range(vocab_size), weights=probs, k=1)[0]
generated_token.append(next_id)
print(decode(generated_token))Refactoring to a Machine‑Learning Style
The same logic is wrapped into classes that resemble PyTorch modules, adding batch handling and separate forward and generate methods.
class Tokenizer:
def __init__(self, text: str):
self.chars = sorted(list(set(text)))
self.vocab_size = len(self.chars)
self.stoi = {ch:i for i,ch in enumerate(self.chars)}
self.itos = {i:ch for i,ch in enumerate(self.chars)}
def encode(self, s: str):
return [self.stoi[c] for c in s]
def decode(self, l):
return ''.join([self.itos[i] for i in l])
class BigramLanguageModel:
def __init__(self, vocab_size: int):
self.vocab_size = vocab_size
self.transition = [[0 for _ in range(vocab_size)] for _ in range(vocab_size)]
def __call__(self, idx):
return self.forward(idx)
def forward(self, idx):
B = len(idx)
T = len(idx[0])
logits = [[[0.0 for _ in range(self.vocab_size)] for _ in range(T)] for _ in range(B)]
for b in range(B):
for t in range(T):
cur = idx[b][t]
logits[b][t] = self.transition[cur]
return logits
def generate(self, idx, max_new_tokens):
for _ in range(max_new_tokens):
logits = self(idx)
logits = [log[-1] for log in logits[-1]]
total = max(sum(logits), 1)
probs = [log/total for log in logits]
next_id = random.choices(range(self.vocab_size), weights=probs, k=1)[0]
idx[0].append(next_id)
return idx5‑Minute PyTorch Tutorial
A brief PyTorch example demonstrates tensor creation, basic arithmetic, and a linear regression model trained with SGD.
import torch
from torch import nn
torch.manual_seed(42)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# data
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([2.0, 4.0, 6.0])
# model
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(1, 1)
def forward(self, x):
return self.linear(x)
model = SimpleNet().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()
# training loop
for epoch in range(5000):
y_pred = model(x.unsqueeze(1))
loss = criterion(y_pred, y.unsqueeze(1))
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch+1) % 100 == 0:
w = model.linear.weight.item()
b = model.linear.bias.item()
print(f'Epoch [{epoch+1}/5000], Loss: {loss.item():.4f}, w: {w:.2f}, b: {b:.2f}')Building a Real Bigram Model with PyTorch
The final model, BabyGPT, consists of an nn.Embedding layer (token → dense vector) and an nn.Linear head that projects back to vocabulary size. Training uses cross‑entropy loss and the AdamW optimizer.
import torch, torch.nn as nn, torch.nn.functional as F
class Tokenizer:
def __init__(self, text):
self.chars = sorted(list(set(text)))
self.vocab_size = len(self.chars)
self.stoi = {c:i for i,c in enumerate(self.chars)}
self.itos = {i:c for i,c in enumerate(self.chars)}
def encode(self, s):
return [self.stoi[c] for c in s]
def decode(self, ids):
return ''.join([self.itos[i] for i in ids])
class BabyGPT(nn.Module):
def __init__(self, vocab_size, n_embd):
super().__init__()
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.lm_head = nn.Linear(n_embd, vocab_size)
def forward(self, idx, targets=None):
tok_emb = self.token_embedding_table(idx) # (B,T,n_embd)
logits = self.lm_head(tok_emb) # (B,T,vocab_size)
if targets is None:
return logits, None
B,T,C = logits.shape
loss = F.cross_entropy(logits.view(B*T, C), targets.view(B*T))
return logits, loss
def generate(self, idx, max_new_tokens):
for _ in range(max_new_tokens):
logits, _ = self(idx)
logits = logits[:, -1, :] # (B, vocab_size)
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, idx_next), dim=1)
return idxModel Details
The embedding layer stores a matrix of shape (vocab_size, n_embd). With vocab_size=6418 and n_embd=32, the total parameter count is 6418*32 + 6418*32 + 6418 ≈ 399 620, which occupies about 1.5 MB of memory.
Training and Evaluation
A training loop samples batches with get_batch, computes loss, back‑propagates, and updates parameters. Periodically the script prints training and validation loss as well as processing speed (tokens per second). Checkpointing can be added with torch.save.
def get_batch(data, batch_size, block_size):
ix = torch.randint(len(data)-block_size, (batch_size,))
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
return x.to(device), y.to(device)
@torch.no_grad()
def estimate_loss(model, data, batch_size, block_size, eval_iters):
model.eval()
out = {}
for split in ['train','val']:
losses = []
for _ in range(eval_iters):
xb, yb = get_batch(data[split], batch_size, block_size)
_, loss = model(xb, yb)
losses.append(loss.item())
out[split] = sum(losses)/len(losses)
model.train()
return outNext Steps
The next article will extend this foundation by adding a self‑attention mechanism, turning the bigram model into a full GPT architecture.
References
karpathy/nanoGPT – https://github.com/karpathy/nanoGPT
simpx/buildyourownllm – https://github.com/simpx/buildyourownllm
《深度学习入门 基于Python的理论与实现》
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
