From Zero to One: A Practical Guide to Pretraining Large Language Models
This comprehensive guide walks through every stage of LLM pretraining—from data sourcing, cleaning, and deduplication, to tokenizer design, model architecture choices, training framework selection, optimization tricks, and evaluation methods—offering actionable tips and pitfalls to avoid.
Background
Large dense models (e.g., Qwen), MoE models (e.g., DeepSeek), and small models (e.g., MiniCPM) dominate the landscape, yet reproducing comparable performance at the same scale remains difficult. Open‑source pretraining is still far off, so mastering the full pretraining pipeline is valuable for both individuals and companies.
Data
Data Collection
Start with roughly 10 TB of raw text, expanding continuously as training progresses. Sources include web crawling, commercial data providers, and existing open datasets such as FineWeb, The Pile, Skypile, and RedPajama. Expect challenges like IP blocking, slow downloads, and the need for multi‑node data splitting.
Data quality varies; high‑knowledge‑density sources (e.g., classical poetry) are far more valuable than generic news. Synthetic high‑density data—summarizing long articles into concise passages—can boost training speed by an order of magnitude.
Data Cleaning
Cleaning is critical. Use a BERT‑based scoring model to rank data quality, then apply heuristic rules to filter out low‑quality content (code, markdown, LaTeX, URLs, sensitive keywords, etc.). Combine scoring with rule‑based filters while ensuring the filtered dataset remains representative.
Data de‑identification (removing personal information, references, etc.) must be performed with regular expressions or custom scripts.
Deduplication
Perform large‑scale deduplication (sentence‑level or document‑level) using a MapReduce framework such as Hadoop or Spark, implementing a simple MinHash algorithm. Adjust similarity thresholds based on target data volume (e.g., 80 % for 10 TB, 90 % for 5 TB).
Data Balancing
Train a lightweight classifier (BERT family) to assign each document to categories like news, encyclopedia, code, or markdown. Apply different cleaning and deduplication thresholds per category, and maintain a balanced mix (e.g., Chinese : English : code = 4 : 4 : 2).
Data Ordering
Curriculum learning matters: ordering data by semantic similarity can improve convergence. One practical approach is to concatenate the most similar documents, as described in the In‑Context Pretraining paper (https://arxiv.org/pdf/2310.10638).
Data Pipeline
Implement a dynamic data loader that streams token IDs directly to the trainer. Separate processes for data preparation and model training keep GPUs busy, and checkpoint metadata should record how many times each document has been used.
Training
Tokenizer
Build a tokenizer on a large common‑crawl corpus using BPE/BBPE. Pay attention to numeric tokenization, compression ratio (≈1.5 Chinese characters per token), removal of toxic tokens, and inclusion of domain‑specific tokens (e.g., medical terms). Ensure the vocab size is a multiple of 128 and leaves a ~1k buffer for the embedding size.
Model Architecture
Adopt a proven architecture such as LLaMA (rope, GQA, RMSNorm, SwiGLU). For models around 1 B parameters, share embedding and lm_head weights; larger models can keep them separate. Avoid unnecessary innovations unless you have robust experimental backing.
Hyper‑parameters
Layer count, hidden size, and head count should be multiples of 2, 4, 8, 64, etc., to satisfy parallelism constraints.
Seq_len should start modest (4K–8K) and increase later; avoid extreme lengths (32K+) without sufficient compute.
Training Framework
For full‑scale pretraining (T‑level token counts), Megatron is recommended; for continue‑pretraining, DeepSpeed is acceptable. Both should use FlashAttention for efficiency.
Megatron advantages: fast tensor/pipeline parallelism, clear configuration, quick model loading.
Megatron drawbacks: steep learning curve, buggy NVIDIA codebase.
DeepSpeed advantages: simple API, strong community support for alignment.
DeepSpeed drawbacks: slower training speed, slower checkpoint loading, limited low‑level control.
Training Tricks
Prioritize data‑parallelism; minimize inter‑node communication.
Avoid unnecessary offloading or recomputation.
Monitor loss per data channel (Chinese knowledge, English knowledge, code).
Watch for loss spikes; they often indicate data or optimizer issues.
Training Schedule
Warm‑up phase: gradually increase learning rate.
Mid‑stage: cosine decay or constant learning rate, tuned on small models.
Late stage: increase rope base and seq_len to handle longer contexts.
Final annealing: use high‑quality data and instruction‑fine‑tuning before benchmarking.
Evaluation
Perplexity (PPL)
Track loss on held‑out knowledge, code, and logic test sets. Aim for overall PPL < 2 on generic knowledge; compare only against your own checkpoints because tokenizer compression rates differ.
Benchmarking
Standard benchmarks are often static multiple‑choice tests. To obtain a more realistic signal, transform them into generative formats (e.g., “Question + options; answer the question”). Use accuracy rather than BLEU/ROUGE.
Probability Probes
Measure token or sentence probabilities for targeted facts (e.g., Prob('Beijing'|'China's capital is')). Observe trends over training to detect knowledge gain or forgetting.
Conclusion
Pretraining involves many equally important steps; data work often yields the biggest gains, while training code can be relatively straightforward once the pipeline is stable. The guide aims to equip practitioners with a practical, end‑to‑end checklist for building their own LLMs.
https://zhuanlan.zhihu.com/p/718354385Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
