Artificial Intelligence 43 min read

Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts

This guide walks you through building, training, and fine‑tuning a Transformer‑based large language model entirely from scratch using PyTorch, covering tokenization, self‑attention, multi‑head attention, positional encoding, model architecture, data preparation, training loops, and fine‑tuning on custom lyrics.

Data STUDIO

Feb 25, 2026

Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts

Introduction

This is a concise, no‑fluff guide that teaches you how to construct, train, and fine‑tune a Transformer architecture from the ground up. Recent releases such as OpenAI's GPT‑OSS model prompt a reflection on the progress made since the seminal 2017 paper Attention Is All You Need , which introduced the Transformer and quickly led to GPT‑1 in 2018.

Over the past eight years, large language models (LLMs) have advanced dramatically, gaining multimodal abilities, sophisticated reasoning, and architectural improvements, yet they all still rely on the core Transformer framework. Because modern LLMs can be accessed via user‑friendly APIs, many developers overlook the elegance of the underlying design.

Tokenizer

Any input text is broken down by the LLM's tokenizer into smaller units called tokens, which can range from a single character to an entire word.

Example text: Hold my math!

Word‑level tokenization: ["Hold", "my", "math", "!"] Sub‑word tokenization: ["Hold", "my", "ma", "th", "!"] Character‑level tokenization:

["H", "o", "l", "d", " ", "m", "y", " ", "m", "a", "t", "h", "!"]

Next Token Predictor

An LLM is fundamentally a next‑token predictor . Given a sequence of input tokens, the model learns to analyze and predict the probability distribution of the next token.

In practice, the model processes a fixed number of tokens at each step, generating one token per iteration. Generation continues by sliding the input window forward until an eos token or a length limit is reached.

Typical API usage might look like this:

messages = [
    {"role": "system", "content": "You are a creative storyteller."},
    {"role": "user", "content": "Write a creative story"},
]

After library processing, the actual string sent to the model differs:

"
<|im_start|>system
You are a creative storyteller.<|im_end|>
<|im_start|>user
Write a creative story<|im_end|>
<|im_start|>assistant
"

Instruction tuning aligns the model with user commands, turning the raw token stream into a format the model can understand.

Attention Is All You Need

How does an LLM decide what to generate? It can produce random text, but meaningful output requires the network to leverage context, not just the last token.

Consider the input The cat chased the. If the model only looks at the final token "the", it might predict anything such as "banana", "man", or "moon". Using the full context ["The", "cat", "chased", "the"] leads the model to assign a higher probability to "mouse".

This principle applies to tasks like translation as well.

English: I eat a red apple. French: Je mange une pomme rouge. Alignment of words shows that in French the adjective red follows the noun apple, illustrating that word‑by‑word translation is insufficient; the model must understand positional relationships.

From RNNs to Attention

Early sequence models used recurrent neural networks (RNNs), which process words sequentially and pass a hidden state forward. While suitable for short dependencies, information from earlier tokens fades as the sequence grows.

Long Short‑Term Memory (LSTM) networks, introduced in 1997, add gated mechanisms (input, forget, output gates) to control information flow, preserving relevant data over longer spans.

Attention, proposed in the 2017 paper, solves the limitation of sequential processing by directly relating each token to every other token, computing weighted contributions that capture long‑range dependencies.

Attention Mechanism Details

The goal is to measure how each token in a sentence influences every other token, producing an attention score for each pair. Collecting all scores yields an attention matrix . To compute these scores, each token provides three vectors:

Query vector – obtained by multiplying the token embedding with a learnable query matrix Wq.

Key vector – obtained by multiplying the token embedding with a learnable key matrix Wk.

Value vector – obtained by multiplying the token embedding with a learnable value matrix Wv.

For a token at position 3 ("slept") in the sentence "The cat slept on the mat and it purred.", the first score is the dot product of its query vector q3 with the key vector of position 1 ( k1), the second score uses k2, and so on. Each raw score is passed through a softmax, then divided by the square root of the embedding dimension ( √d) to obtain normalized attention weights. The final context vector is the weighted sum of the value vectors.

In causal language modeling, future positions are masked before applying softmax to ensure the probability of a token depends only on past tokens.

Building the Transformer Architecture

The Transformer consists of encoder and decoder modules. Each module contains the same core components: token embeddings, positional encodings, self‑attention, multi‑head attention, and a feed‑forward layer.

Conceptually, the architecture can be seen as a modular pipeline :

Input text is split into tokens.

Tokens are converted to embeddings and added to positional encodings.

Self‑attention lets each token attend to every other token.

Multi‑head attention runs several self‑attention heads in parallel, capturing information from different sub‑spaces.

A feed‑forward network transforms the aggregated information into richer features.

Stacking many such layers enables the model to learn increasingly sophisticated language representations.

Tokenization

The first step is to split the input text into tokens. You can implement a character‑level or word‑level tokenizer from scratch, but the tutorial uses the ready‑made tiktoken library for convenience. The essential steps are:

import tiktoken

def text_to_token_ids(text, tokenizer, device):
    encoded = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0).to(device)  # add batch dim
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0)  # remove batch dim
    return tokenizer.decode(flat.tolist())

For custom vocabularies you can implement a byte‑pair encoder.

Positional Encoding and Embedding

Because the Transformer processes all tokens in parallel, it needs positional encodings to retain order information. Fixed sinusoidal functions are commonly used to generate a matrix that adds richer positional signals to token embeddings.

Embedding layer maps each token ID to a dense vector of size attention_dim. The embedding matrix has shape (vocab_size, attention_dim). Positional encoding has the same shape, allowing element‑wise addition.

self.embedding = torch.nn.Embedding(vocab_size, attention_dim)
self.positional_embedding = torch.nn.Embedding(context_length, attention_dim)

Forward pass:

embeddings = self.embedding(context)
context_len = context.shape[1]
position = torch.arange(context_len, device=context.device).unsqueeze(0)
position_embeddings = self.positional_embedding(position)

e = embeddings + position_embeddings

Self‑Attention: How Tokens Gossip

Define learnable query, key, and value projection matrices and compute the three vectors:

self.w_key = torch.nn.Linear(embed_dim, attention_dim, bias=bias)
self.w_query = torch.nn.Linear(embed_dim, attention_dim, bias=bias)
self.w_value = torch.nn.Linear(embed_dim, attention_dim, bias=bias)

k = self.w_key(x)   # (B, T, A)
q = self.w_query(x) # (B, T, A)
v = self.w_value(x) # (B, T, A)

Scaled dot‑product attention scores:

scores = (q @ k.transpose(-2, -1)) / (k.size(-1) ** 0.5)  # (B, T, T)

Apply causal mask and softmax, then combine with values:

mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
scores = scores.masked_fill(mask, float('-1e10'))
attn = scores.softmax(dim=-1)  # (B, T, T)
final = attn @ v  # (B, T, A)

Optionally add dropout (e.g., 0.1) for regularisation.

Full module:

class SelfAttention(torch.nn.Module):
    def __init__(self, embed_dim, attention_dim, bias=False, dropout=0.1):
        super().__init__()
        self.w_key = torch.nn.Linear(embed_dim, attention_dim, bias=bias)
        self.w_query = torch.nn.Linear(embed_dim, attention_dim, bias=bias)
        self.w_value = torch.nn.Linear(embed_dim, attention_dim, bias=bias)
        self.dropout = torch.nn.Dropout(dropout)

    def forward(self, x):
        B, T, _ = x.size()
        k = self.w_key(x)   # (B, T, A)
        q = self.w_query(x) # (B, T, A)
        v = self.w_value(x) # (B, T, A)
        scores = (q @ k.transpose(-2, -1)) / (k.size(-1) ** 0.5)  # (B, T, T)
        mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
        scores = scores.masked_fill(mask, float('-1e10'))
        attn = scores.softmax(dim=-1)  # (B, T, T)
        attn = self.dropout(attn)
        return attn @ v  # (B, T, A)

Multi‑Head Attention: Group Chat in the Model Brain

Multiple self‑attention heads run in parallel, each learning to focus on different aspects of the data.

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, num_heads, embed_dim, attention_dim, dropout=0.1):
        super().__init__()
        self.head_size = attention_dim // num_heads
        self.heads = torch.nn.ModuleList()
        for _ in range(num_heads):
            self.heads.append(SelfAttention(embed_dim=embed_dim, attention_dim=self.head_size, dropout=dropout))

    def forward(self, x):
        head_outputs = []
        for head in self.heads:
            head_outputs.append(head(x))  # B x T x (A/num_heads)
        concatenated = torch.cat(head_outputs, dim=2)
        return concatenated

The total attention dimension is split among the heads, and the concatenated output is fed to the next layer.

Feed‑Forward Network

After attention, each token embedding passes through a small feed‑forward network, typically two linear layers with a non‑linear activation (GELU) in between.

class FeedForward(torch.nn.Module):
    def __init__(self, attention_dim):
        super().__init__()
        self.up = torch.nn.Linear(attention_dim, attention_dim * 4)
        self.gelu = torch.nn.GELU()
        self.down = torch.nn.Linear(attention_dim * 4, attention_dim)
    def forward(self, x):
        return self.down(self.gelu(self.up(x)))

Decoder with Residual Connections

The decoder stacks a masked multi‑head attention layer followed by a feed‑forward network. Each sub‑layer adds a residual connection and is followed by layer normalisation.

class Decoder(torch.nn.Module):
    def __init__(self, num_heads, embed_dim, attention_dim, dropout=0.1):
        super().__init__()
        self.masked_multihead = MultiHeadAttention(num_heads, embed_dim, attention_dim, dropout)
        self.feed_forward = FeedForward(attention_dim)
        self.n1 = torch.nn.LayerNorm(attention_dim)
        self.n2 = torch.nn.LayerNorm(attention_dim)

    def forward(self, x):
        e = self.masked_multihead(self.n1(x))
        e = e + x
        e = self.feed_forward(self.n2(e))
        return e

Assembling the Transformer Skeleton

The final model combines the embedding, positional embedding, a stack of decoder blocks, a final layer‑norm, and a linear projection to vocabulary size (the LM head).

class GPT(torch.nn.Module):
    def __init__(self, num_heads, vocab_size, embed_dim, attention_dim, num_blocks, context_length, dropout_rate):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, attention_dim)
        self.positional_embedding = torch.nn.Embedding(context_length, attention_dim)
        self.decoders = torch.nn.ModuleList([
            Decoder(num_heads, attention_dim, attention_dim, dropout_rate) for _ in range(num_blocks)
        ])
        self.exit_norm = torch.nn.LayerNorm(attention_dim)
        self.linear = torch.nn.Linear(attention_dim, vocab_size)

    def forward(self, context):
        embeddings = self.embedding(context)
        context_len = context.shape[1]
        position = torch.arange(context_len, device=context.device).unsqueeze(0)
        position_embeddings = self.positional_embedding(position)
        e = embeddings + position_embeddings
        for decoder in self.decoders:
            e = decoder(e)
        return self.linear(self.exit_norm(e))

Large GPT models may use up to 25 decoder blocks, which dramatically increases memory consumption and requires massive training data.

After building the model, you can test generation with a simple top‑k sampling function.

def top_k_logits(logits, k):
    v, ix = torch.topk(logits, k)
    out = logits.clone()
    out[out < v[:, [-1]]] = float('-inf')
    return out

def generate(model, max_new_tokens, context, context_length, temperature=1.0, top_k=None):
    res = []
    for _ in range(max_new_tokens):
        if context.shape[1] > context_length:
            context = context[:, -context_length:]
        logits = model(context)  # [B, T, V]
        logits = logits[:, -1, :]  # [B, V]
        logits = logits / max(temperature, 1e-3)
        if top_k is not None:
            logits = top_k_logits(logits, top_k)
        if torch.isnan(logits).any() or torch.isinf(logits).any():
            raise ValueError("Logits contain NaN or Inf")
        probabilities = torch.nn.functional.softmax(logits, dim=-1)
        probabilities = torch.clamp(probabilities, min=1e-9, max=1.0)
        next_token = torch.multinomial(probabilities, 1)  # [B, 1]
        context = torch.cat((context, next_token), dim=1)
    return context

start_context = "I want something"
model = GPT(num_heads, vocab_size, embed_dim, attention_dim, num_blocks, context_length, dropout_rate).to(device)
model.eval()
token_ids = generate(
    model=model,
    context=text_to_token_ids(start_context, tokenizer, device),
    max_new_tokens=10,
    context_length=context_length
)
print("Output text:
", token_ids_to_text(token_ids, tokenizer))

Typical output may look like random‑looking text, e.g.:

Output text:
 I want something introduceウ coaches Kard Judaism trendsCommerce rotating infiltration approach

Model Pre‑training

Pre‑training teaches the model basic English grammar and semantics. A large corpus of English text is required; the tutorial uses the public IMDb dataset as an example.

Data Preparation

Load the dataset, keep only ASCII characters, and concatenate all reviews into a single string.

from datasets import load_dataset
import re

ds = load_dataset("stanfordnlp/imdb")

def keep_english_only(text):
    return re.sub(r"[^\x00-\x7F]+", "", text)

def combine_and_clean(text_list):
    cleaned_list = [keep_english_only(t) for t in text_list]
    combined = "".join(cleaned_list)
    combined = re.sub(r'\s+', '', combined).strip()
    return combined

train_text_data = combine_and_clean(ds['train']['text'])
test_text_data = combine_and_clean(ds['test']['text'])

Define a dataset class that creates overlapping input‑target pairs using a sliding window.

from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        token_ids = tokenizer.encode(txt, add_special_tokens=False)
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1:i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    def __len__(self):
        return len(self.input_ids)
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

def create_encoded_dataloader(txt, tokenizer, batch_size=4, max_length=128, stride=128, shuffle=True, drop_last=True, num_workers=0):
    dataset = CustomDataset(txt, tokenizer, max_length, stride)
    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers, pin_memory=True)

Inspect dataset size:

total_characters = len(train_text_data)
total_tokens = len(tokenizer.encode(train_text_data))
print("Characters:", total_characters)
print("Tokens:", total_tokens)

If the token count is insufficient for the chosen context length, the script suggests lowering the context length or adjusting the training ratio.

Training

Initialize weights to ensure a stable starting point.

def initialize_weights(module):
    if isinstance(module, torch.nn.Linear):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        if module.bias is not None:
            torch.nn.init.zeros_(module.bias)
    elif isinstance(module, torch.nn.Embedding):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    elif isinstance(module, torch.nn.LayerNorm):
        torch.nn.init.ones_(module.weight)
        torch.nn.init.zeros_(module.bias)

model.apply(initialize_weights)

Use cross‑entropy loss because predicting the next token is a multi‑class classification problem.

Example loss calculation for a short sequence:

Position 0 target "cat" → P=0.90 → L₀ = −log(0.9) ≈ 0.105

Position 1 target "sat" → P=0.10 → L₁ = −log(0.1) = 2.302

Position 2 target "on" → P=0.05 → L₂ = −log(0.05) = 2.996

Position 3 target "the" → P=0.75 → L₃ = −log(0.75) ≈ 0.288

Position 4 target "mat" → P=0.75 → L₄ = −log(0.75) ≈ 0.288

Average loss: L₍avg₎ = (0.105+2.302+2.996+0.288+0.288)/5 ≈ 1.20.

Training loop highlights:

Gradient clipping to prevent exploding gradients.

Early stopping when validation loss does not improve.

AdamW optimizer for decoupled weight decay.

Learning‑rate scheduler CosineWithWarmup linearly ramps up the LR for a warm‑up phase, then follows a cosine decay to a minimum value.

class CosineWithWarmup(torch.optim.lr_scheduler._LRScheduler):
    def __init__(self, optimizer, warmup_steps, total_steps, base_lr, min_lr, last_epoch=-1):
        self.warmup_steps = max(1, warmup_steps)
        self.total_steps = max(self.warmup_steps + 1, total_steps)
        self.base_lr = base_lr
        self.min_lr = min_lr
        super().__init__(optimizer, last_epoch)
    def get_lr(self):
        step = self.last_epoch + 1
        lrs = []
        for _ in self.base_lrs:
            if step <= self.warmup_steps:
                lr = self.base_lr * step / self.warmup_steps
            else:
                progress = (step - self.warmup_steps) / max(1, self.total_steps - self.warmup_steps)
                lr = self.min_lr + 0.5 * (self.base_lr - self.min_lr) * (1 + math.cos(math.pi * progress))
            lrs.append(lr)
        return lrs

Training settings example:

settings = {
    "learning_rate": 3e-4,
    "weight_decay": 0.1,
    "num_epochs": 300,
    "batch_size": 32,
    "warmup_steps": 1500,
    "max_lr": 3e-4,
    "min_lr": 3e-5,
    "eval_freq": 200,
    "eval_iter": 20,
    "gradient_clip": 1.0,
    "patience": 50,
    "min_improvement": 1e-4,
    "print_interval": 1,
    "generate_interval": 5,
}

Training loop (simplified):

def train_model(model, train_loader, val_loader, device, settings, save_path="checkpoints/gpt.pt"):
    torch.manual_seed(123)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(123)
    model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=settings["learning_rate"], weight_decay=settings["weight_decay"], betas=(0.9, 0.95))
    total_steps = settings["num_epochs"] * len(train_loader)
    scheduler = CosineWithWarmup(optimizer, warmup_steps=settings["warmup_steps"], total_steps=total_steps, base_lr=settings["max_lr"], min_lr=settings["min_lr"])
    best_val_loss = float("inf")
    patience_counter = 0
    global_step = -1
    for epoch in range(settings["num_epochs"]):
        model.train()
        for inp, tgt in train_loader:
            loss = calc_loss_batch(inp, tgt, model, device)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), settings["gradient_clip"])
            optimizer.step()
            optimizer.zero_grad(set_to_none=True)
            scheduler.step()
            global_step += 1
            if global_step % settings["eval_freq"] == 0:
                train_loss, val_loss = evaluate_model(model, train_loader, val_loader, device, eval_iter=settings["eval_iter"])
                lr_now = optimizer.param_groups[0]["lr"]
                print(f"Ep {epoch+1} | step {global_step:06d} | lr {lr_now:.3e} | train {train_loss:.3f} | val {val_loss:.3f}")
                if val_loss + settings["min_improvement"] < best_val_loss:
                    best_val_loss = val_loss
                    patience_counter = 0
                    os.makedirs(os.path.dirname(save_path), exist_ok=True)
                    torch.save({"model_state": model.state_dict(), "optimizer_state": optimizer.state_dict(), "epoch": epoch, "global_step": global_step}, save_path)
                    print(f"[Checkpoint saved at step {global_step}]")
                else:
                    patience_counter += 1
                    if patience_counter >= settings["patience"]:
                        print("Early stopping triggered.")
                        return
    return

Fine‑tuning on a Small Coldplay Lyrics Dataset

After pre‑training, the model can generate basic English. To give it a Coldplay style, fine‑tune on a small Coldplay lyrics corpus using a lower learning rate and fewer epochs.

settings_ft = {
    "learning_rate": 1e-5,
    "weight_decay": 0.01,
    "num_epochs": 5,
    "batch_size": 4,
    "warmup_steps": 100,
    "max_lr": 1e-5,
    "min_lr": 1e-6,
    "eval_freq": 50,
    "eval_iter": 5,
    "gradient_clip": 0.5,
    "patience": 3,
    "min_improvement": 1e-4,
    "print_interval": 1,
    "generate_interval": 2,
}

train_losses_ft, val_losses_ft, tokens_seen_ft = train_model(
    model,
    train_dataloader_ft,
    val_dataloader_ft,
    tokenizer,
    device,
    settings=settings_ft,
    context_length=context_length,
    save_path="checkpoints/gpt_finetuned_coldplay.pt",
    sample_prompt="Look at the star look how the"
)

Sample output after fine‑tuning:

lights go out and the stars begin to fall i hear your voice across the night
lights are running in circles chasing the echoes
you are the star that keeps me alive
Oh‑ooh‑oh‑ooh oh, oh
i will follow you, i will follow you

Conclusion

You have now built a Transformer‑based large language model from the ground up using PyTorch, covering every step from tokenization and attention mechanisms to model assembly, pre‑training on a public dataset, and fine‑tuning on a custom lyric corpus. The tutorial provides detailed code, mathematical explanations, and practical tips that are often omitted elsewhere.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM transformer fine-tuning tokenization PyTorch Self-Attention GPT

Written by

Data STUDIO

Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.