Artificial Intelligence 9 min read

nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation

This article revisits nanochat's core components, detailing the preparation of diverse training datasets, the scaling calculations for tokens and parameters, the model's MQA and KV‑cache design, the full training pipeline with gradient accumulation and mixed‑precision, cost breakdown, inference optimizations, evaluation tasks, and identified limitations with suggested improvements.

AI2ML AI to Machine Learning

Oct 20, 2025

nanochat Source Code Deep Dive: Data Prep, Model Design, Training & Evaluation

Training Data Preparation

The pipeline uses four data splits: base_train (FineWeb‑EDU, 100 B tokens, shard size 524,288 tokens), mid_train (SmolTalk 460 K rows plus MMLU 100 K and GSM8K 8 K rows), chat_sft (ARC‑Easy 2.3 K, ARC‑Challenge 1.1 K, GSM8K 8 K, SmolTalk 10 K rows), and chat_rl (Human Feedback GSM8K, 8 K math problems). The target token count is computed using the Chinchilla scaling rule (target = 20 × num_params). With num_params = 560,988,160 (~561M), the target tokens are ≈ 11.2 B, requiring ≈ 54 B characters. Assuming 250 M characters per shard, 240 shards (~24 GB) are needed, downloaded via python -m nanochat.dataset -n 240. Tokenization is performed on‑the‑fly during training, saving disk space.

Batch Size Calculation

Total batch size is derived from

grad_accum_steps × device_batch_size × max_seq_len × ddp_world_size

. Parameters: max_seq_len = 2048, device_batch_size = 32, ddp_world_size = 8, grad_accum_steps = 1. This yields world_tokens_per_fwdbwd = 32 × 2048 × 8 = 524,288 tokens, and the total batch size equals this value multiplied by the accumulation steps.

Model Architecture

The model follows a "VE + N·12E" sizing rule where V is the vocab size, N the number of transformer layers, and E the embedding dimension. The "aspect‑ratio" philosophy sets E = 64 × N, leading to a parameter count of roughly ≈ 16 × Size. Heads are calculated as num_heads = (E + 127) // 128 (head_dim = 128). Multi‑Query Attention (MQA) uses num_kv_heads = num_heads × MQA_ratio with MQA_ratio = 1, reducing key/value cache memory by three‑fold.

Data Loading Strategies

base_train

uses a custom dataloader.py that reads Parquet shards, applies regex splitting, BPE tokenization (pre‑tokenized), and random document concatenation to a block size of 2048, with 4–8 workers for streaming. mid_train mixes tasks via mid_data_generator, chat_sft uses masked dialogue loading through sft_data_generator, and chat_rl employs a simplified GRPO algorithm that computes rewards and advantages from generated token sequences.

Training Framework

Forward passes use F.cross_entropy. The training loop combines gradient accumulation steps with a scheduler (illustrated in the article). Supervised fine‑tuning applies a mask strategy. Reinforcement learning uses a simplified GRPO with no KL penalty, online policy updates, and token‑level advantage (reward − mean). Mixed‑precision training is enabled via torch.amp.autocast. Cost analysis shows 8 × H100 GPUs at $3 per GPU‑hour, totaling roughly $90 for a 3.75‑hour run.

Inference Framework

KV‑Cache stores key/value tensors, updating only for new tokens. The forward pass and generator mode both read from kv_cache, reducing compute at the expense of increased memory I/O. Multi‑Query Attention further cuts memory usage.

Evaluation Suite

Benchmarks include ARC (science MCQs), MMLU (57 subjects), GSM8K (math word problems with tool use), HumanEval (Python code generation), and SmolTalk (general dialogue). Classification tasks feed the prompt and all options to the model, extract logits for option letters, and select the highest‑logit answer, enabling batch processing. Generation tasks produce full outputs, which are parsed and scored; they are inherently slower due to autoregressive decoding. Loss is measured in bits‑per‑byte (BPB) rather than perplexity to allow fair comparison across tokenizers.

Deployment & Runtime

Model saving and loading steps are illustrated with screenshots. A directory strategy organizes model files. An OpenAI‑compatible FastAPI server wraps the model for inference.

Limitations & Recommendations

Fixed sequence length (1,024 tokens) wastes computation on short inputs – consider variable‑length batching.

Lack of instruction‑tuning – add intermediate stages with Alpaca/Dolly data.

No RLHF – implement a generic RL framework with preference learning.

Potential enhancements: integrate FlashAttention, adopt sliding‑window attention for longer contexts.

Summary

In under 8,000 lines of code, nanochat implements a complete pipeline: tokenizer → pre‑training → intermediate training → SFT → reinforcement learning → deployment. It achieves production‑grade speed, efficiency, distributed mixed‑precision training, clear documentation for each decision, minimal dependencies (PyTorch + a few high‑quality libraries), and full reproducibility via fixed seeds and comprehensive configuration logs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM evaluation PyTorch training data model architecture kv cache nanochat MQA

Written by

AI2ML AI to Machine Learning

Original articles on artificial intelligence and machine learning, deep optimization. Less is more, life is simple! Shi Chunqi

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.