Artificial Intelligence 34 min read

From Zero to One: A Practical Guide to Pretraining Large Language Models

This comprehensive guide walks you through every stage of LLM pretraining—from data sourcing, cleaning, and deduplication to tokenizer design, model architecture choices, training framework selection, optimization tricks, and evaluation methods—highlighting common pitfalls and practical solutions for building robust models.

Baobao Algorithm Notes

Sep 24, 2024

From Zero to One: A Practical Guide to Pretraining Large Language Models

Author: ybq
Link: https://zhuanlan.zhihu.com/p/718354385

Background

Large dense models (e.g., Qwen), MOE models (e.g., DeepSeek), and small models (e.g., MiniCPM) dominate the landscape, yet reproducing a better model of the same size remains difficult. Open‑source pretraining is still far off, so mastering the pretraining pipeline remains valuable for contributing to future open‑source releases, enabling domain‑specific continue‑pretraining, and ensuring full control over data and tokenization.

Data Section

Data Collection

The first step is to acquire roughly 10 TB of raw text, expanding the corpus continuously as training proceeds. Sources include web crawling, commercial data providers, and public datasets such as FineWeb, The Pile, Skypile, and RedPajama. PDF documents often require specialized OCR or LLM‑based parsing services because generic Python libraries struggle with formulas and tables.

Downloading from HuggingFace may require mirrors (e.g., hf_mirror) and parallel processes across multiple servers; large file listings can overwhelm standard file‑system commands, necessitating big‑data cluster tools.

Data quality varies: high‑knowledge‑density texts (e.g., Tang Poetry) are far more valuable than low‑density news articles. Synthetic high‑knowledge data—summarizing long articles into concise passages—can accelerate training by up to tenfold.

Data Cleaning

Cleaning is the most critical step. A common practice is to train a BERT‑based scoring model to rank data quality, then filter out low‑scoring items. Code, markdown, and LaTeX typically receive low scores and should be extracted before cleaning.

Scoring models need not be perfect; a rough 4 K‑token classifier is sufficient. Combine scoring with heuristic rules such as token‑type ratios, language ratios, presence of URLs, or prohibited keywords (e.g., political, adult content). Ensure rules do not bias the data distribution (e.g., removing all English‑heavy documents would create a monolingual corpus).

Data de‑identification is mandatory: strip personal names, phone numbers, emails, and any copyrighted references using regular expressions.

Deduplication

For terabyte‑scale corpora, deduplication is unavoidable. Use a map‑reduce framework (Hadoop, Spark, etc.) to implement MinHash‑based similarity detection. Decide between sentence‑level or document‑level deduplication based on resource constraints.

Set similarity thresholds according to target corpus size: 80 % for a 10 TB target, 90 % for 5 TB, etc. Remember that deduplication is an ongoing process; continue refining as more data arrives.

Data Balancing

Train a lightweight classifier (BERT family) to label documents into categories such as news, encyclopedia, code, markdown, etc. Apply different cleaning thresholds per category and prioritize high‑scoring items during deduplication.

Typical Chinese‑English‑code ratio is 4:4:2, with logical data (math, chain‑of‑thought) added as needed.

Data Ordering (Curriculum Learning)

Ordering data influences learning efficiency. Group semantically similar documents together (e.g., using Llama’s In‑Context Pretraining method) to create coherent contexts. Llama recommends that a document should not see unrelated preceding documents during training.

Experiment with attention masks to encourage topic‑switching ability, though many teams find masks have negligible impact.

Data Pipeline

Pretraining loads data dynamically: read‑process‑train cycles repeat throughout training. Token IDs, not raw tokens, are fed to the model, so tokenization and concatenation must be pre‑computed to avoid GPU stalls.

Maintain two parallel processes: a data‑processing worker that continuously produces JSONL shards and a training worker that consumes them. Tag each shard with usage counts to reduce over‑sampling of the same documents.

Chunk sizes should be powers of two (1 B, 2 B, 4 B, etc.) to simplify checkpoint rollback.

Training Section

Tokenizer Design

Design a tokenizer early; expanding the vocab later can break existing token mappings (e.g., adding "中华人民" may prevent "中华人民共和国" from being encoded). Use a large‑memory CPU node and a massive common‑crawl dataset with BPE/BBPE algorithms.

Split numbers to avoid ambiguous tokenization (e.g., 9.9 vs 9.11).

Control compression rate: ~1.5 Chinese characters per token yields lower loss.

Manually remove toxic or noisy tokens.

Add domain‑specific tokens (e.g., medical drug names) to improve compression.

Ensure vocab size is a multiple of 128 and leaves a ~1 K buffer relative to the model’s embedding size.

Model Architecture

Follow proven Llama‑style designs (RoPE, GQA, RMSNorm, SwiGLU). For ~1 B models, share embedding and LM head parameters; larger models can keep them separate.

Scale layer count and hidden size proportionally; keep hyper‑parameters divisible by common factors (2, 4, 8, 64, 128) to simplify pipeline and tensor parallelism.

Avoid overly long sequence lengths at the start; begin with 4 K–8 K tokens, then gradually increase to 32 K–64 K using RoPE extrapolation.

Training Frameworks

Use Megatron for full‑scale pretraining (T‑level token counts) and DeepSpeed for continue‑pretraining or smaller experiments. Megatron offers fast tensor/pipeline parallelism and extensive configurability but has a steep learning curve and occasional bugs. DeepSpeed is easier to use and widely adopted in alignment work, though it can be slower to load and less flexible for low‑level modifications.

Regardless of framework, enable FlashAttention for optimal performance.

Training Optimizations

Prioritize intra‑GPU communication over inter‑GPU and inter‑node communication.

Prefer data parallelism; avoid unnecessary offloading or recomputation.

Cache intermediate results when possible.

Monitor loss curves per data type (Chinese knowledge, English knowledge, code). Watch for loss spikes, which often indicate data issues or optimizer anomalies; adjust AdamW hyper‑parameters accordingly.

Training Schedule

Warm‑up phase: gradually increase learning rate.

Mid‑training phase: apply cosine decay or constant schedules based on small‑model experiments.

Late phase: increase RoPE base and sequence length to adapt to longer contexts.

Final annealing: fine‑tune with high‑quality data and IFT data before benchmarking.

Evaluation Section

Perplexity (PPL)

Track test‑set loss on curated knowledge, logic, and code subsets. Aim for overall PPL below 2 on a high‑quality knowledge benchmark; note that PPL is only comparable within the same tokenizer.

Benchmarking

Standard benchmarks often favor models that have been fine‑tuned on the test data. To avoid this, transform benchmarks into generative tasks (e.g., provide multiple‑choice questions and ask the model to generate the answer) and evaluate using accuracy rather than raw scores.

Probability Probes

Measure token or sentence probabilities for specific facts (e.g., Prob('Beijing' | 'The capital of China is')). Observe trends over training to detect knowledge acquisition or forgetting.

Conclusion

All stages—from data acquisition and cleaning to tokenizer construction, model scaling, framework selection, and evaluation—are equally critical. While infrastructure teams can automate training runs, the most impactful improvements often arise from clever data handling and curriculum design.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data cleaning curriculum learning tokenizer Training Framework LLM pretraining

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.