Build a Full‑Scale LLM from Scratch in 61 Lines of Python
This step‑by‑step tutorial shows how to set up a GPU environment, prepare custom text data, train a tokenizer, configure and train a GPT‑2‑based large language model, test its generation, and run the entire pipeline using only 61 lines of Python code.
Overview
This guide demonstrates how to pre‑train a small GPT‑2‑style large language model (LLM) from scratch on a Chinese text corpus. The workflow covers environment setup, data acquisition, tokenizer training, model training, inference, and reproducible execution via Docker.
1. Environment Setup
Run on a GPU instance (e.g., NVIDIA T4 with 16 GB VRAM) using Python 3.11. Install the required packages listed in requirements.txt:
tokenizers==0.13.3
torch==2.0.1
transformers==4.302. Data Preparation
Download the training text (the novel Romance of the Three Kingdoms ):
https://raw.githubusercontent.com/xinzhanguo/hellollm/main/text/sanguoyanyi.txtSave it as text/sanguoyanyi.txt.
3. Tokenizer Training
Build a Byte‑Level BPE tokenizer compatible with GPT‑2 and include the special tokens <s>, <pad>, </s>, <unk>, <mask>:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.normalizers import NFKC, Sequence
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from transformers import GPT2TokenizerFast
tokenizer = Tokenizer(BPE(unk_token="<unk>"))
tokenizer.normalizer = Sequence([NFKC()])
tokenizer.pre_tokenizer = ByteLevel()
tokenizer.decoder = ByteLevelDecoder()
special_tokens = ["<s>", "<pad>", "</s>", "<unk>", "<mask>"]
trainer = BpeTrainer(vocab_size=50000, show_progress=True,
initial_alphabet=ByteLevel.alphabet(),
special_tokens=special_tokens)
files = ["text/sanguoyanyi.txt"]
tokenizer.train(files, trainer)
gpt2_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)
gpt2_tokenizer.save_pretrained("./sanguo")After execution, the sanguo directory contains merges.txt, vocab.json, and related files.
4. Model Training
Load the tokenizer, configure a GPT‑2 model with matching vocabulary size, and train on the line‑by‑line dataset:
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer
from transformers import LineByLineTextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
# Load tokenizer and add special tokens
tokenizer = GPT2Tokenizer.from_pretrained("./sanguo")
tokenizer.add_special_tokens({
"eos_token": "</s>",
"bos_token": "<s>",
"unk_token": "<unk>",
"pad_token": "<pad>",
"mask_token": "<mask>"
})
# Model configuration
config = GPT2Config(vocab_size=tokenizer.vocab_size,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id)
model = GPT2LMHeadModel(config)
# Dataset (each line is a training example)
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./text/sanguoyanyi.txt",
block_size=32 # reduce if GPU memory is limited
)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments(
output_dir="./output",
overwrite_output_dir=True,
num_train_epochs=20,
per_gpu_train_batch_size=16,
save_steps=2000,
save_total_limit=2,
logging_steps=500
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset
)
trainer.train()
model.save_pretrained("./sanguo")Training produces config.json, generation_config.json, and pytorch_model.bin inside the sanguo folder.
5. Inference
Generate text with the Hugging Face pipeline API:
from transformers import pipeline, set_seed
generator = pipeline("text-generation", model="./sanguo")
set_seed(42)
print(generator("吕布", max_length=10))Typical output continues the Three Kingdoms narrative, e.g., "吕布十二回 张翼德 ...".
6. Full Repository and Execution
The complete script ( sanguo.py) and data are hosted at:
https://github.com/xinzhanguo/hellollm/blob/main/sanguo.pyTo run locally:
# Create a virtual environment
python3 -m venv ~/.env
source ~/.env/bin/activate
# Clone the repository
git clone [email protected]:xinzhanguo/hellollm.git
cd hellollm
# Install dependencies
pip install -r requirements.txt
# Execute the training script
python sanguo.py7. Docker Alternative
A Dockerfile is provided for a reproducible environment. Build and run the container (GPU optional):
# Build the image
docker build -t hellollm:beta .
# Run (CPU only)
docker run -it hellollm:beta sh -c "python sanguo.py"
# Run with GPU (if Docker is configured for GPU support)
# docker run -it --gpus all hellollm:beta sh -c "python sanguo.py"Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
