Artificial Intelligence 10 min read

Build a Full‑Scale LLM from Scratch in 61 Lines of Python

This step‑by‑step tutorial shows how to set up a GPU environment, prepare custom text data, train a tokenizer, configure and train a GPT‑2‑based large language model, test its generation, and run the entire pipeline using only 61 lines of Python code.

Tencent Cloud Developer

Jul 19, 2023

Build a Full‑Scale LLM from Scratch in 61 Lines of Python

Overview

This guide demonstrates how to pre‑train a small GPT‑2‑style large language model (LLM) from scratch on a Chinese text corpus. The workflow covers environment setup, data acquisition, tokenizer training, model training, inference, and reproducible execution via Docker.

1. Environment Setup

Run on a GPU instance (e.g., NVIDIA T4 with 16 GB VRAM) using Python 3.11. Install the required packages listed in requirements.txt:

tokenizers==0.13.3
torch==2.0.1
transformers==4.30

2. Data Preparation

Download the training text (the novel Romance of the Three Kingdoms ):

https://raw.githubusercontent.com/xinzhanguo/hellollm/main/text/sanguoyanyi.txt

Save it as text/sanguoyanyi.txt.

3. Tokenizer Training

Build a Byte‑Level BPE tokenizer compatible with GPT‑2 and include the special tokens <s>, <pad>, </s>, <unk>, <mask>:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.normalizers import NFKC, Sequence
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from transformers import GPT2TokenizerFast

tokenizer = Tokenizer(BPE(unk_token="<unk>"))
tokenizer.normalizer = Sequence([NFKC()])
tokenizer.pre_tokenizer = ByteLevel()
tokenizer.decoder = ByteLevelDecoder()

special_tokens = ["<s>", "<pad>", "</s>", "<unk>", "<mask>"]
trainer = BpeTrainer(vocab_size=50000, show_progress=True,
                     initial_alphabet=ByteLevel.alphabet(),
                     special_tokens=special_tokens)
files = ["text/sanguoyanyi.txt"]
tokenizer.train(files, trainer)

gpt2_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)
gpt2_tokenizer.save_pretrained("./sanguo")

After execution, the sanguo directory contains merges.txt, vocab.json, and related files.

4. Model Training

Load the tokenizer, configure a GPT‑2 model with matching vocabulary size, and train on the line‑by‑line dataset:

from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer
from transformers import LineByLineTextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

# Load tokenizer and add special tokens
tokenizer = GPT2Tokenizer.from_pretrained("./sanguo")
tokenizer.add_special_tokens({
    "eos_token": "</s>",
    "bos_token": "<s>",
    "unk_token": "<unk>",
    "pad_token": "<pad>",
    "mask_token": "<mask>"
})

# Model configuration
config = GPT2Config(vocab_size=tokenizer.vocab_size,
                    bos_token_id=tokenizer.bos_token_id,
                    eos_token_id=tokenizer.eos_token_id)
model = GPT2LMHeadModel(config)

# Dataset (each line is a training example)
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./text/sanguoyanyi.txt",
    block_size=32  # reduce if GPU memory is limited
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    num_train_epochs=20,
    per_gpu_train_batch_size=16,
    save_steps=2000,
    save_total_limit=2,
    logging_steps=500
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)
trainer.train()
model.save_pretrained("./sanguo")

Training produces config.json, generation_config.json, and pytorch_model.bin inside the sanguo folder.

5. Inference

Generate text with the Hugging Face pipeline API:

from transformers import pipeline, set_seed

generator = pipeline("text-generation", model="./sanguo")
set_seed(42)
print(generator("吕布", max_length=10))

Typical output continues the Three Kingdoms narrative, e.g., "吕布十二回张翼德 ...".

6. Full Repository and Execution

The complete script ( sanguo.py) and data are hosted at:

https://github.com/xinzhanguo/hellollm/blob/main/sanguo.py

To run locally:

# Create a virtual environment
python3 -m venv ~/.env
source ~/.env/bin/activate

# Clone the repository
git clone [email protected]:xinzhanguo/hellollm.git
cd hellollm

# Install dependencies
pip install -r requirements.txt

# Execute the training script
python sanguo.py

7. Docker Alternative

A Dockerfile is provided for a reproducible environment. Build and run the container (GPU optional):

# Build the image
docker build -t hellollm:beta .

# Run (CPU only)
docker run -it hellollm:beta sh -c "python sanguo.py"

# Run with GPU (if Docker is configured for GPU support)
# docker run -it --gpus all hellollm:beta sh -c "python sanguo.py"

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker Python AI LLM Model Training tokenizer GPT-2

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.