Artificial Intelligence 7 min read

How to Build and Train Sub‑1B Language Models from Scratch: Resources & Tips

This guide compiles open‑source repositories, research papers, and practical tricks for training miniature large‑language models under 1 billion parameters, helping readers learn by reproducing models like nanoGPT, tinyLlama, Phi‑1.5, and more.

NewBeeNLP

Jul 15, 2024

How to Build and Train Sub‑1B Language Models from Scratch: Resources & Tips

The author believes the best way to learn is to build a model from the ground up and shares a curated list of resources for training sub‑1B language models, given limited hardware.

nanoGPT – a minimal yet complete implementation of GPT‑2 by Karpathy, available in four sizes from 0.1B to 1.5B parameters.

https://www.kaggle.com/code/pritishmishra/gpt-training-on-wikipedia-dataset-from-scratch

https://zhuanlan.zhihu.com/p/79714797

https://zhuanlan.zhihu.com/p/606339093

https://finisky.github.io/2020/05/01/pretrainchinesegpt/

https://zhuanlan.zhihu.com/p/656758138

https://github.com/minimalist-nlp/gpt2-text-generation

tinyLlama – a miniature Llama replica trained over 90 days on 16 × A100‑40G GPUs, matching Llama’s architecture for seamless replacement.

pythia – EleutherAI’s repository offering models from 14 M to 12 B parameters for academic research.

OLMo – AllenAI’s open‑source LLM with 1B and 7B variants, providing full training data, code, and checkpoints.

Qwen1.5 – Alibaba’s Chinese‑focused LLM, smallest version 0.5B, regarded as a top performer for Chinese tasks.

Phi‑1.5 – Microsoft’s 350 M and 1.3 B models trained on high‑quality textbook‑style data using 6 B tokens on eight A100 GPUs for four days; followed by Phi‑2 (2.7 B) without a formal paper.

OpenELM – Apple’s suite of models ranging from 0.27 B to 3 B, targeting mobile deployment.

Community projects and smaller‑scale experiments include:

https://github.com/charent/ChatLM-mini-Chinese – 0.2 B Chinese model based on T5.

https://github.com/jiahe7ay/MINI_LLM – 1.4 B Chinese model built on Qwen.

https://github.com/DLLXW/baby-llama2-chinese – Llama‑2‑based Chinese model, intended 0.5 B but limited to 0.2 B.

https://github.com/OpenBMB/MiniCPM – 2.7 B model claimed to rival Mistral‑7B.

https://github.com/Chinese-Tiny-LLM/Chinese-Tiny-LLM – 2 B Chinese model still in training.

https://github.com/keeeeenw/MicroLlama – 0.3 B Llama variant, a further miniaturization of TinyLlama.

https://github.com/zhanshijinwat/Steel-LLM – Planned pre‑training project, not yet started.

Additional practical tips and resources for training small models:

Book "Build a LLM from Scratch" (13k ★ on GitHub, still in progress).

Awesome Chinese LLM list – curated datasets.

Paper "MobileLLM" – training tricks for compact models.

Article "Llama from Scratch" – analysis of key Llama components.

"Rethinking Optimization and Architecture for Tiny Language Models" – detailed review (https://zhuanlan.zhihu.com/p/681614203).

MNBVC – massive Chinese corpus for training.

RedPajama – replication of Llama’s dataset.

LLM open-source training small models nanoGPT

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.