Artificial Intelligence 32 min read

From Zero to One: A Practical Guide to Pretraining Large Language Models

This comprehensive guide walks through every stage of LLM pretraining—from data sourcing, cleaning, and deduplication, to tokenizer design, model architecture choices, training framework selection, optimization tricks, and evaluation methods—offering actionable tips and pitfalls to avoid.

NewBeeNLP

Sep 25, 2024

From Zero to One: A Practical Guide to Pretraining Large Language Models

Background

Large dense models (e.g., Qwen), MoE models (e.g., DeepSeek), and small models (e.g., MiniCPM) dominate the landscape, yet reproducing comparable performance at the same scale remains difficult. Open‑source pretraining is still far off, so mastering the full pretraining pipeline is valuable for both individuals and companies.

Data

Data Collection

Start with roughly 10 TB of raw text, expanding continuously as training progresses. Sources include web crawling, commercial data providers, and existing open datasets such as FineWeb, The Pile, Skypile, and RedPajama. Expect challenges like IP blocking, slow downloads, and the need for multi‑node data splitting.

Data quality varies; high‑knowledge‑density sources (e.g., classical poetry) are far more valuable than generic news. Synthetic high‑density data—summarizing long articles into concise passages—can boost training speed by an order of magnitude.

Data Cleaning

Cleaning is critical. Use a BERT‑based scoring model to rank data quality, then apply heuristic rules to filter out low‑quality content (code, markdown, LaTeX, URLs, sensitive keywords, etc.). Combine scoring with rule‑based filters while ensuring the filtered dataset remains representative.

Data de‑identification (removing personal information, references, etc.) must be performed with regular expressions or custom scripts.

Deduplication

Perform large‑scale deduplication (sentence‑level or document‑level) using a MapReduce framework such as Hadoop or Spark, implementing a simple MinHash algorithm. Adjust similarity thresholds based on target data volume (e.g., 80 % for 10 TB, 90 % for 5 TB).

Data Balancing

Train a lightweight classifier (BERT family) to assign each document to categories like news, encyclopedia, code, or markdown. Apply different cleaning and deduplication thresholds per category, and maintain a balanced mix (e.g., Chinese : English : code = 4 : 4 : 2).

Data Ordering

Curriculum learning matters: ordering data by semantic similarity can improve convergence. One practical approach is to concatenate the most similar documents, as described in the In‑Context Pretraining paper (https://arxiv.org/pdf/2310.10638).

Data Pipeline

Implement a dynamic data loader that streams token IDs directly to the trainer. Separate processes for data preparation and model training keep GPUs busy, and checkpoint metadata should record how many times each document has been used.

Training

Tokenizer

Build a tokenizer on a large common‑crawl corpus using BPE/BBPE. Pay attention to numeric tokenization, compression ratio (≈1.5 Chinese characters per token), removal of toxic tokens, and inclusion of domain‑specific tokens (e.g., medical terms). Ensure the vocab size is a multiple of 128 and leaves a ~1k buffer for the embedding size.

Model Architecture

Adopt a proven architecture such as LLaMA (rope, GQA, RMSNorm, SwiGLU). For models around 1 B parameters, share embedding and lm_head weights; larger models can keep them separate. Avoid unnecessary innovations unless you have robust experimental backing.

Hyper‑parameters

Layer count, hidden size, and head count should be multiples of 2, 4, 8, 64, etc., to satisfy parallelism constraints.

Seq_len should start modest (4K–8K) and increase later; avoid extreme lengths (32K+) without sufficient compute.

Training Framework

For full‑scale pretraining (T‑level token counts), Megatron is recommended; for continue‑pretraining, DeepSpeed is acceptable. Both should use FlashAttention for efficiency.

Megatron advantages: fast tensor/pipeline parallelism, clear configuration, quick model loading.

Megatron drawbacks: steep learning curve, buggy NVIDIA codebase.

DeepSpeed advantages: simple API, strong community support for alignment.

DeepSpeed drawbacks: slower training speed, slower checkpoint loading, limited low‑level control.

Training Tricks

Prioritize data‑parallelism; minimize inter‑node communication.

Avoid unnecessary offloading or recomputation.

Monitor loss per data channel (Chinese knowledge, English knowledge, code).

Watch for loss spikes; they often indicate data or optimizer issues.

Training Schedule

Warm‑up phase: gradually increase learning rate.

Mid‑stage: cosine decay or constant learning rate, tuned on small models.

Late stage: increase rope base and seq_len to handle longer contexts.

Final annealing: use high‑quality data and instruction‑fine‑tuning before benchmarking.

Evaluation

Perplexity (PPL)

Track loss on held‑out knowledge, code, and logic test sets. Aim for overall PPL < 2 on generic knowledge; compare only against your own checkpoints because tokenizer compression rates differ.

Benchmarking

Standard benchmarks are often static multiple‑choice tests. To obtain a more realistic signal, transform them into generative formats (e.g., “Question + options; answer the question”). Use accuracy rather than BLEU/ROUGE.

Probability Probes

Measure token or sentence probabilities for targeted facts (e.g., Prob('Beijing'|'China's capital is')). Observe trends over training to detect knowledge gain or forgetting.

Conclusion

Pretraining involves many equally important steps; data work often yields the biggest gains, while training code can be relatively straightforward once the pipeline is stable. The guide aims to equip practitioners with a practical, end‑to‑end checklist for building their own LLMs.

https://zhuanlan.zhihu.com/p/718354385

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data collection evaluation methods Training Framework LLM pretraining tokenizer design

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.