LLM.c: A 1000‑Line C Implementation for Training GPT‑2
Andrej Karpathy’s LLM.c project demonstrates how a compact, pure‑C (and CUDA) codebase of roughly 1000 lines can train a GPT‑2 model, covering data preparation, memory management, layer implementations, compilation, and practical tips for running and testing the model on CPUs and GPUs.
Andrej Karpathy, former OpenAI director, announced a personal project called LLM.c , a minimalist C/CUDA implementation that can train GPT‑2 using only about 1000 lines of code, without relying on large frameworks such as PyTorch or Python.
The repository quickly rose to the top of Hacker News and earned over 2600 GitHub stars. It allocates all required memory in a single 1‑D array, keeping memory usage constant throughout training and simplifying pointer arithmetic.
Key components implemented manually include layer‑norm (forward and backward), encoder, matrix multiplication, self‑attention, GELU, residual connections, softmax, and cross‑entropy loss. Karpathy notes that writing these low‑level routines is tedious and error‑prone because every pointer and tensor offset must be correct.
To get started, download the tiny Shakespeare dataset and tokenize it:
python prepro_tinyshakespeare.pyThe script produces binary files containing int32 token IDs:
Saved 32768 tokens to data/tiny_shakespeare_val.bin
Saved 305260 tokens to data/tiny_shakespeare_train.binBecause the pure‑C reference is slow on CPU/fp32, Karpathy uses pretrained GPT‑2 (124M) weights from OpenAI for initialization. The weights are saved as gpt2_124M.bin (raw model parameters) and gpt2_124M_debug_state.bin (including inputs, targets, logits, and loss for debugging).
Compile the training binary:
make train_gpt2Run it with an appropriate thread count (e.g., 8 threads on an 8‑core CPU):
OMP_NUM_THREADS=8 ./train_gpt2The training loop performs Adam updates at a learning rate of 1e‑4, periodically printing loss and generating sample text. Sample output shows token IDs that can be decoded with a GPT‑2 tokenizer such as tiktoken :
import tiktoken
enc = tiktoken.get_encoding("gpt2")
print(enc.decode(list(map(int, "50256 16773 18162 ...".split()))))Karpathy is porting the code to CUDA to improve speed and plans to add support for modern architectures like Llama 2, Gemma, and Mistral, as well as mixed‑precision (fp16) and RoPE extensions.
Unit tests are provided to verify that the C implementation matches the PyTorch reference. Build and run the tests with:
make test_gpt2
./test_gpt2The project has sparked a discussion about a possible “C language renaissance” in AI research, emphasizing simplicity, transparency, and minimal dependencies.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.