Artificial Intelligence 10 min read

NanoChat Source Code Deep Dive: Karpathy’s Full‑Stack LLM Pipeline Explained

This article dissects NanoChat’s end‑to‑end LLM pipeline—from a lightweight 561M‑parameter transformer and custom Rust BPE tokenizer to Chinchilla‑scaled training, multi‑task fine‑tuning, optional RL on GSM8K, KV‑cache inference optimizations, and benchmark results that slightly surpass GPT‑2 Large.

AI2ML AI to Machine Learning

Oct 15, 2025

NanoChat Source Code Deep Dive: Karpathy’s Full‑Stack LLM Pipeline Explained

Core Value of NanoChat

NanoChat combines a cutting‑edge lightweight large‑model architecture with a production‑ready full‑stack pipeline that includes data processing, a scoring system, and a web UI.

NC Ten‑Step Workflow

Environment setup – Python virtual env and Rust toolchain.

Tokenizer training – BPE on ~2 B characters, 65,536 token vocab.

Data preparation – parallel download of 240 shards (~24 GB).

Base model pre‑training – 561 M‑parameter model trained on 8 H100 GPUs.

Base model evaluation – loss, sample generation, CORE metric.

Mid‑training – special dialogue tokens and formats.

Supervised fine‑tuning – domain adaptation.

Reinforcement learning – optional GSM8K math task.

Inference mode – CLI and Web UI chat interface.

Final report – markdown summary of metrics.

Key Files

scripts/base_train.py

– handles distributed pre‑training with gradient accumulation and mixed‑precision. nanochat/model.py – defines a decoder‑only Transformer with configurable depth. scripts/chat_eval.py – unified entry for CORE, GSM8K, ARC, MMLU, etc., and generates evaluation reports. rustbpe/ – Rust implementation of a high‑performance BPE tokenizer compiled via Maturin. nanochat/report.py – aggregates training/evaluation metrics into a markdown report.

Custom BPE Tokenizer (Rust)

Python tokenizers were deemed too slow; a Rust BPE tokenizer was built, yielding a 65,536‑token vocab, trained on 2 B characters with a compression ratio of 4.8 characters per token. Training uses the Rust tokenizer, while inference falls back to OpenAI’s tiktoken for efficiency.

Transformer Model Architecture

Base version (d20) specifications:

20 layers, hidden size 1,280, 10 attention heads (128 dim each).

~561 M parameters, ~4×10¹⁹ FLOPs.

Derived from Llama, stripped of bias and RMSNorm parameters.

Optimizes only attention/FFN matrices with Muon and embedding matrices with AdamW.

Learning rate scales as 1/√dim.

Training Methodology

Chinchilla Scaling Law

Target token count = 20× parameters → 11.2 B tokens. With a compression ratio of 4.8, this equals ~54 B characters, requiring 240 × 250 MB shards (~24 GB). Each step processes 32 × 2048 tokens (0.5 M tokens/step), yielding ~21,400 steps (~3 h total).

Bits‑Per‑Byte (BPB) Metric

BPB = loss / log(2) / avg bytes per token, providing a tokenizer‑independent quality measure. Training/validation BPB ≈ 0.81. CORE score = 0.22, slightly better than GPT‑2 Large (0.21) but below GPT‑2 XL.

CORE Benchmark

Evaluates 22 datasets (HellaSwag, Jeopardy, BigBench QA, WikiData, ARC‑Easy/Challenge, COPA, CommonsenseQA, PIQA, LAMBADA, Winograd, BoolQ, etc.) using the DCLM‑recommended metric and periodic evaluation during pre‑training.

Data Processing

Pre‑training Data – FineWeb‑EDU

Source: HuggingFace FineWeb‑EDU (mostly English educational web text).

Total shards: 1,822; each ~0.25 M characters (~100 MB gzipped).

Used 240 shards (~540 B characters) stored at ~/.cache/nanochat/ in Parquet format.

Custom lightweight loader replaces heavy HuggingFace datasets library.

Mid‑training Data – SmolTalk

50 % high‑quality dialogues, 20 % GitHub README/documentation, 15 % code, 10 % GSM8K math, 5 % supplemental data.

SFT Data – Curated SmolTalk

Selects highest‑quality samples, matches inference length distribution, and packs sequences as during testing, yielding modest but consistent performance gains.

Three‑Stage Fine‑Tuning

Pre‑training : next‑token prediction on 54 B characters (~3 h), producing a base autocomplete model.

Mid‑training : adapts to dialogue format and special tokens (~8 min), outputting a chat‑capable model.

Supervised Fine‑Tuning (SFT) : domain‑adapted safety training on curated dialogue (~7 min), aligning token length to 2048 and matching test distribution, resulting in a production‑grade model.

Optional Reinforcement Learning

Uses a simplified GRPO algorithm (variant of PPO) on GSM8K math tasks. Rewards are correct answers; training loop: sample → score → train. Improves Pass@1 and Pass@8, especially for larger d30 model. Currently limited to GSM8K and not fully RLHF‑tuned.

Inference Optimizations

KV‑cache reduces recomputation during generation.

Two‑stage inference: Prefill processes the full prompt; Decode generates token‑by‑token.

Python interpreter integration for intermediate calculations (e.g., GSM8K).

Web service built with FastAPI, front‑end in HTML + JavaScript, single‑command deployment.

Performance Comparison

Base d20 model (561 M params) achieves CORE 0.22, marginally surpassing GPT‑2 Large (0.21). Planned d26 model aims for GPT‑2‑level performance (CORE ≈ 0.25‑0.26).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Model FastAPI nanochat Chinchilla scaling CORE benchmark rust BPE tokenizer

Written by

AI2ML AI to Machine Learning

Original articles on artificial intelligence and machine learning, deep optimization. Less is more, life is simple! Shi Chunqi

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.