NanoChat Source Code Deep Dive: Karpathy’s Full‑Stack LLM Pipeline Explained
This article dissects NanoChat’s end‑to‑end LLM pipeline—from a lightweight 561M‑parameter transformer and custom Rust BPE tokenizer to Chinchilla‑scaled training, multi‑task fine‑tuning, optional RL on GSM8K, KV‑cache inference optimizations, and benchmark results that slightly surpass GPT‑2 Large.
Core Value of NanoChat
NanoChat combines a cutting‑edge lightweight large‑model architecture with a production‑ready full‑stack pipeline that includes data processing, a scoring system, and a web UI.
NC Ten‑Step Workflow
Environment setup – Python virtual env and Rust toolchain.
Tokenizer training – BPE on ~2 B characters, 65,536 token vocab.
Data preparation – parallel download of 240 shards (~24 GB).
Base model pre‑training – 561 M‑parameter model trained on 8 H100 GPUs.
Base model evaluation – loss, sample generation, CORE metric.
Mid‑training – special dialogue tokens and formats.
Supervised fine‑tuning – domain adaptation.
Reinforcement learning – optional GSM8K math task.
Inference mode – CLI and Web UI chat interface.
Final report – markdown summary of metrics.
Key Files
scripts/base_train.py– handles distributed pre‑training with gradient accumulation and mixed‑precision. nanochat/model.py – defines a decoder‑only Transformer with configurable depth. scripts/chat_eval.py – unified entry for CORE, GSM8K, ARC, MMLU, etc., and generates evaluation reports. rustbpe/ – Rust implementation of a high‑performance BPE tokenizer compiled via Maturin. nanochat/report.py – aggregates training/evaluation metrics into a markdown report.
Custom BPE Tokenizer (Rust)
Python tokenizers were deemed too slow; a Rust BPE tokenizer was built, yielding a 65,536‑token vocab, trained on 2 B characters with a compression ratio of 4.8 characters per token. Training uses the Rust tokenizer, while inference falls back to OpenAI’s tiktoken for efficiency.
Transformer Model Architecture
Base version (d20) specifications:
20 layers, hidden size 1,280, 10 attention heads (128 dim each).
~561 M parameters, ~4×10¹⁹ FLOPs.
Derived from Llama, stripped of bias and RMSNorm parameters.
Optimizes only attention/FFN matrices with Muon and embedding matrices with AdamW.
Learning rate scales as 1/√dim.
Training Methodology
Chinchilla Scaling Law
Target token count = 20× parameters → 11.2 B tokens. With a compression ratio of 4.8, this equals ~54 B characters, requiring 240 × 250 MB shards (~24 GB). Each step processes 32 × 2048 tokens (0.5 M tokens/step), yielding ~21,400 steps (~3 h total).
Bits‑Per‑Byte (BPB) Metric
BPB = loss / log(2) / avg bytes per token, providing a tokenizer‑independent quality measure. Training/validation BPB ≈ 0.81. CORE score = 0.22, slightly better than GPT‑2 Large (0.21) but below GPT‑2 XL.
CORE Benchmark
Evaluates 22 datasets (HellaSwag, Jeopardy, BigBench QA, WikiData, ARC‑Easy/Challenge, COPA, CommonsenseQA, PIQA, LAMBADA, Winograd, BoolQ, etc.) using the DCLM‑recommended metric and periodic evaluation during pre‑training.
Data Processing
Pre‑training Data – FineWeb‑EDU
Source: HuggingFace FineWeb‑EDU (mostly English educational web text).
Total shards: 1,822; each ~0.25 M characters (~100 MB gzipped).
Used 240 shards (~540 B characters) stored at ~/.cache/nanochat/ in Parquet format.
Custom lightweight loader replaces heavy HuggingFace datasets library.
Mid‑training Data – SmolTalk
50 % high‑quality dialogues, 20 % GitHub README/documentation, 15 % code, 10 % GSM8K math, 5 % supplemental data.
Uses OpenAI Harmony format with special tokens <|user|>, <|assistant|>, <|system|> and multi‑turn structure.
SFT Data – Curated SmolTalk
Selects highest‑quality samples, matches inference length distribution, and packs sequences as during testing, yielding modest but consistent performance gains.
Three‑Stage Fine‑Tuning
Pre‑training : next‑token prediction on 54 B characters (~3 h), producing a base autocomplete model.
Mid‑training : adapts to dialogue format and special tokens (~8 min), outputting a chat‑capable model.
Supervised Fine‑Tuning (SFT) : domain‑adapted safety training on curated dialogue (~7 min), aligning token length to 2048 and matching test distribution, resulting in a production‑grade model.
Optional Reinforcement Learning
Uses a simplified GRPO algorithm (variant of PPO) on GSM8K math tasks. Rewards are correct answers; training loop: sample → score → train. Improves Pass@1 and Pass@8, especially for larger d30 model. Currently limited to GSM8K and not fully RLHF‑tuned.
Inference Optimizations
KV‑cache reduces recomputation during generation.
Two‑stage inference: Prefill processes the full prompt; Decode generates token‑by‑token.
Python interpreter integration for intermediate calculations (e.g., GSM8K).
Web service built with FastAPI, front‑end in HTML + JavaScript, single‑command deployment.
Performance Comparison
Base d20 model (561 M params) achieves CORE 0.22, marginally surpassing GPT‑2 Large (0.21). Planned d26 model aims for GPT‑2‑level performance (CORE ≈ 0.25‑0.26).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI2ML AI to Machine Learning
Original articles on artificial intelligence and machine learning, deep optimization. Less is more, life is simple! Shi Chunqi
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
