How to Build and Train Sub‑1B Language Models from Scratch: Resources & Tips
This guide compiles open‑source repositories, research papers, and practical tricks for training miniature large‑language models under 1 billion parameters, helping readers learn by reproducing models like nanoGPT, tinyLlama, Phi‑1.5, and more.
The author believes the best way to learn is to build a model from the ground up and shares a curated list of resources for training sub‑1B language models, given limited hardware.
nanoGPT – a minimal yet complete implementation of GPT‑2 by Karpathy, available in four sizes from 0.1B to 1.5B parameters.
https://www.kaggle.com/code/pritishmishra/gpt-training-on-wikipedia-dataset-from-scratch
https://zhuanlan.zhihu.com/p/79714797
https://zhuanlan.zhihu.com/p/606339093
https://finisky.github.io/2020/05/01/pretrainchinesegpt/
https://zhuanlan.zhihu.com/p/656758138
https://github.com/minimalist-nlp/gpt2-text-generation
tinyLlama – a miniature Llama replica trained over 90 days on 16 × A100‑40G GPUs, matching Llama’s architecture for seamless replacement.
pythia – EleutherAI’s repository offering models from 14 M to 12 B parameters for academic research.
OLMo – AllenAI’s open‑source LLM with 1B and 7B variants, providing full training data, code, and checkpoints.
Qwen1.5 – Alibaba’s Chinese‑focused LLM, smallest version 0.5B, regarded as a top performer for Chinese tasks.
Phi‑1.5 – Microsoft’s 350 M and 1.3 B models trained on high‑quality textbook‑style data using 6 B tokens on eight A100 GPUs for four days; followed by Phi‑2 (2.7 B) without a formal paper.
OpenELM – Apple’s suite of models ranging from 0.27 B to 3 B, targeting mobile deployment.
Community projects and smaller‑scale experiments include:
https://github.com/charent/ChatLM-mini-Chinese – 0.2 B Chinese model based on T5.
https://github.com/jiahe7ay/MINI_LLM – 1.4 B Chinese model built on Qwen.
https://github.com/DLLXW/baby-llama2-chinese – Llama‑2‑based Chinese model, intended 0.5 B but limited to 0.2 B.
https://github.com/OpenBMB/MiniCPM – 2.7 B model claimed to rival Mistral‑7B.
https://github.com/Chinese-Tiny-LLM/Chinese-Tiny-LLM – 2 B Chinese model still in training.
https://github.com/keeeeenw/MicroLlama – 0.3 B Llama variant, a further miniaturization of TinyLlama.
https://github.com/zhanshijinwat/Steel-LLM – Planned pre‑training project, not yet started.
Additional practical tips and resources for training small models:
Book "Build a LLM from Scratch" (13k ★ on GitHub, still in progress).
Awesome Chinese LLM list – curated datasets.
Paper "MobileLLM" – training tricks for compact models.
Article "Llama from Scratch" – analysis of key Llama components.
"Rethinking Optimization and Architecture for Tiny Language Models" – detailed review (https://zhuanlan.zhihu.com/p/681614203).
MNBVC – massive Chinese corpus for training.
RedPajama – replication of Llama’s dataset.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
