Build a Minimal Large Language Model from Scratch with Python and PyTorch

This tutorial walks through creating a simple bigram language model in pure Python, refactoring it into a PyTorch implementation, and explains core concepts such as tokenization, embedding layers, loss functions, gradient descent, training loops, and text generation, preparing you for building a full GPT model.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Build a Minimal Large Language Model from Scratch with Python and PyTorch

Introduction

Large language models (LLMs) are popular, but many articles only discuss concepts superficially. This guide shows how to start from zero and implement a minimal yet complete LLM in Python, then refactor it with PyTorch, making the underlying mechanisms of self‑attention and transformers concrete.

Scope and Requirements

Only basic Python knowledge is needed.

The model will generate Chinese poetry (the dataset contains Song and Southern Tang poems).

Mathematical or machine‑learning theory is not explained in depth; the focus is on a working implementation.

Dataset

The dataset ci.txt contains lines of classical poems. A vocabulary is built from all unique characters, resulting in vocab_size = 6418. Each character is treated as a token.

Simple Bigram Model in Python

The first implementation counts character‑to‑character transitions and samples the next character proportionally to observed frequencies.

import random
random.seed(42)  # remove for random results
prompt = "春江"
max_new_token = 100
with open('ci.txt', 'r', encoding='utf-8') as f:
    text = f.read()
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
transition = [[0 for _ in range(vocab_size)] for _ in range(vocab_size)]
for i in range(len(text)-1):
    cur = encode(text[i])[0]
    nxt = encode(text[i+1])[0]
    transition[cur][nxt] += 1
generated_token = encode(prompt)
for i in range(max_new_token-1):
    cur = generated_token[-1]
    logits = transition[cur]
    total = sum(logits)
    probs = [logit/total for logit in logits]
    next_id = random.choices(range(vocab_size), weights=probs, k=1)[0]
    generated_token.append(next_id)
print(decode(generated_token))

Refactoring to a Machine‑Learning Style

The same logic is wrapped into classes that resemble PyTorch modules, adding batch handling and separate forward and generate methods.

class Tokenizer:
    def __init__(self, text: str):
        self.chars = sorted(list(set(text)))
        self.vocab_size = len(self.chars)
        self.stoi = {ch:i for i,ch in enumerate(self.chars)}
        self.itos = {i:ch for i,ch in enumerate(self.chars)}
    def encode(self, s: str):
        return [self.stoi[c] for c in s]
    def decode(self, l):
        return ''.join([self.itos[i] for i in l])

class BigramLanguageModel:
    def __init__(self, vocab_size: int):
        self.vocab_size = vocab_size
        self.transition = [[0 for _ in range(vocab_size)] for _ in range(vocab_size)]
    def __call__(self, idx):
        return self.forward(idx)
    def forward(self, idx):
        B = len(idx)
        T = len(idx[0])
        logits = [[[0.0 for _ in range(self.vocab_size)] for _ in range(T)] for _ in range(B)]
        for b in range(B):
            for t in range(T):
                cur = idx[b][t]
                logits[b][t] = self.transition[cur]
        return logits
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits = self(idx)
            logits = [log[-1] for log in logits[-1]]
            total = max(sum(logits), 1)
            probs = [log/total for log in logits]
            next_id = random.choices(range(self.vocab_size), weights=probs, k=1)[0]
            idx[0].append(next_id)
        return idx

5‑Minute PyTorch Tutorial

A brief PyTorch example demonstrates tensor creation, basic arithmetic, and a linear regression model trained with SGD.

import torch
from torch import nn
torch.manual_seed(42)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# data
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([2.0, 4.0, 6.0])
# model
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1)
    def forward(self, x):
        return self.linear(x)
model = SimpleNet().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()
# training loop
for epoch in range(5000):
    y_pred = model(x.unsqueeze(1))
    loss = criterion(y_pred, y.unsqueeze(1))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if (epoch+1) % 100 == 0:
        w = model.linear.weight.item()
        b = model.linear.bias.item()
        print(f'Epoch [{epoch+1}/5000], Loss: {loss.item():.4f}, w: {w:.2f}, b: {b:.2f}')

Building a Real Bigram Model with PyTorch

The final model, BabyGPT, consists of an nn.Embedding layer (token → dense vector) and an nn.Linear head that projects back to vocabulary size. Training uses cross‑entropy loss and the AdamW optimizer.

import torch, torch.nn as nn, torch.nn.functional as F
class Tokenizer:
    def __init__(self, text):
        self.chars = sorted(list(set(text)))
        self.vocab_size = len(self.chars)
        self.stoi = {c:i for i,c in enumerate(self.chars)}
        self.itos = {i:c for i,c in enumerate(self.chars)}
    def encode(self, s):
        return [self.stoi[c] for c in s]
    def decode(self, ids):
        return ''.join([self.itos[i] for i in ids])

class BabyGPT(nn.Module):
    def __init__(self, vocab_size, n_embd):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)
    def forward(self, idx, targets=None):
        tok_emb = self.token_embedding_table(idx)          # (B,T,n_embd)
        logits = self.lm_head(tok_emb)                     # (B,T,vocab_size)
        if targets is None:
            return logits, None
        B,T,C = logits.shape
        loss = F.cross_entropy(logits.view(B*T, C), targets.view(B*T))
        return logits, loss
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, _ = self(idx)
            logits = logits[:, -1, :]                     # (B, vocab_size)
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

Model Details

The embedding layer stores a matrix of shape (vocab_size, n_embd). With vocab_size=6418 and n_embd=32, the total parameter count is 6418*32 + 6418*32 + 6418 ≈ 399 620, which occupies about 1.5 MB of memory.

Training and Evaluation

A training loop samples batches with get_batch, computes loss, back‑propagates, and updates parameters. Periodically the script prints training and validation loss as well as processing speed (tokens per second). Checkpointing can be added with torch.save.

def get_batch(data, batch_size, block_size):
    ix = torch.randint(len(data)-block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

@torch.no_grad()
def estimate_loss(model, data, batch_size, block_size, eval_iters):
    model.eval()
    out = {}
    for split in ['train','val']:
        losses = []
        for _ in range(eval_iters):
            xb, yb = get_batch(data[split], batch_size, block_size)
            _, loss = model(xb, yb)
            losses.append(loss.item())
        out[split] = sum(losses)/len(losses)
    model.train()
    return out

Next Steps

The next article will extend this foundation by adding a self‑attention mechanism, turning the bigram model into a full GPT architecture.

References

karpathy/nanoGPT – https://github.com/karpathy/nanoGPT

simpx/buildyourownllm – https://github.com/simpx/buildyourownllm

《深度学习入门 基于Python的理论与实现》

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonLLMTutorialPyTorchBigramLanguageModel
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.