Artificial Intelligence 33 min read

From Zero to One: A Practical Guide to Pretraining Large Language Models

This comprehensive guide walks through every stage of building a large‑language‑model pretraining pipeline—from data sourcing, cleaning, and deduplication, to tokenizer design, model architecture choices, training framework selection, optimization tricks, and evaluation methods—providing actionable tips and pitfalls to avoid for both newcomers and seasoned practitioners.

Baobao Algorithm Notes

Dec 23, 2024

From Zero to One: A Practical Guide to Pretraining Large Language Models

Background

Large‑scale dense models (e.g., Qwen), MoE models (e.g., DeepSeek), and small models (e.g., MiniCPM) dominate the current landscape, yet reproducing comparable performance remains difficult for both individuals and enterprises. Open‑source pretraining is still far off because most released models omit training frameworks and data, keeping the core pipeline effectively closed. Mastering pretraining is valuable for future open‑source contributions, domain‑specific continue‑pretraining, and controlling tokenization and inference speed.

Data Chapter

Data Crawling

The first step is to acquire roughly 10 TB of raw text. Sources include web crawling, e‑commerce sites, data vendors, and public datasets such as FineWeb, The Pile, Skypile, and RedPajama. PDF‑based resources (papers, books) often require robust OCR or specialized PDF parsing services; generic Python libraries struggle with formulas and tables.

Downloading from HuggingFace may require mirrors (e.g., hf_mirror) and parallel processes across multiple servers due to bandwidth limits and massive file counts.

Data Cleaning

Cleaning is the most critical step. Use a BERT‑based scoring model to rank data quality, then filter out low‑scoring items. Pay special attention to code, markdown, and LaTeX formats, which typically receive low scores and should be extracted before cleaning.

A rough scorer that separates high‑ and low‑quality samples is sufficient. Combine scorer output with heuristic rules (e.g., token length thresholds, language ratios, presence of URLs, prohibited keywords) to filter data.

Data sanitization must also remove personal information (names, phone numbers, emails) and any copyrighted or sensitive content.

Data Deduplication

For terabyte‑scale corpora, deduplication is essential. Implement either sentence‑level or document‑level deduplication using a MapReduce framework (Hadoop, Spark) or a custom MinHash implementation. Adjust similarity thresholds based on the target data volume (e.g., 80 % for 10 TB, 90 % for 5 TB).

Data Mixing

Train a lightweight classifier (BERT family) to label each document as news, encyclopedia, code, markdown, etc. Apply different cleaning thresholds per category and prioritize higher‑scoring samples during deduplication.

Typical Chinese‑English‑code mix ratios are 4:4:2, but the exact split should be tuned to the available data and downstream needs.

Data Ordering

Curriculum learning matters: ordering easier or high‑knowledge‑density documents before harder ones can improve convergence. One practical approach is to group semantically similar documents using embedding similarity and concatenate them, as described in In‑Context Pretraining (https://arxiv.org/pdf/2310.10638).

Data Pipeline

Pretraining reads data dynamically in chunks (e.g., 1 B, 2 B, 4 B tokens). Each chunk should be tokenized, concatenated, and padded ahead of time to avoid GPU stalls. Track usage counts per document to reduce over‑sampling of the same data.

Training Chapter

Tokenizer

Design a tokenizer that balances compression rate (tokens per Chinese character) and vocabulary coverage. Recommended practices include:

Split numbers to avoid ambiguous tokenization.

Maintain a compression ratio of roughly 1 token ≈ 1.5 Chinese characters.

Manually remove toxic or unwanted tokens.

Add domain‑specific tokens (e.g., medical drug names) to improve compression for target tasks.

Ensure vocab size is a multiple of 128 and leaves a ~1 k buffer relative to the model’s embedding size.

Model Architecture

For stability, start with the proven LLaMA stack: RoPE positional encoding, GQA, RMSNorm, and SwiGLU. Small models (< 1 B) should share embedding and LM head parameters; larger models can keep them separate.

Avoid premature architectural innovations unless extensive experiments validate them, as changes can waste massive compute budgets.

Model Hyper‑parameters

Scale layer count and hidden size proportionally (the “well‑shaped” principle). Choose values divisible by common hardware factors (2, 4, 8, 64, 128). Typical settings:

layer_num divisible by pipeline size.

num_head a multiple of 8 (tensor‑parallel factor).

hidden_size and vocab_size multiples of 128.

Seq_len should start modest (4 K–8 K) with RoPE base adjustments before scaling to 32 K–64 K.

Training Framework

For scratch pretraining at the T‑token scale, Megatron‑LM is recommended due to its optimized tensor‑ and pipeline‑parallel kernels and clear configuration. For continue‑pretraining at the B‑token scale, DeepSpeed offers a simpler codebase and strong community support.

Both frameworks should use FlashAttention for efficiency.

Training Tricks

Prioritize communication efficiency: intra‑GPU > inter‑GPU > inter‑node. Use data parallelism whenever possible, avoid unnecessary offloading, and cache intermediate results to reduce recomputation.

Loss Analysis

Monitor per‑category losses (Chinese knowledge, English knowledge, code). Watch for loss spikes, which often indicate data quality issues or optimizer anomalies. Adjust AdamW hyper‑parameters as needed.

Training Process

Typical schedule:

Warm‑up phase with gradually increasing learning rate.

Middle phase with cosine decay or constant learning rate, tuned via small‑model experiments.

Late phase: increase RoPE base and seq_len to adapt to longer contexts.

Final annealing with high‑quality data (e.g., IFT) before benchmark evaluation.

Automate checkpointing; if loss diverges, roll back to the previous checkpoint.

Evaluation Chapter

PPL

Track perplexity on held‑out knowledge, logic, and code test sets. Aim for PPL < 2 on generic knowledge benchmarks; compare only against your own model due to tokenizer differences.

Benchmark

Standard multiple‑choice benchmarks are limited; consider converting them into generative formats (e.g., “Question + Answer_A … Answer_D. Provide the correct answer.”) or redesigning them to avoid memorization effects.

Probability Probes

Construct targeted probes to monitor specific token or sentence probabilities over training (e.g., Prob('北京' | ‘中国的首都是’)). Use trends rather than absolute values to assess knowledge retention or forgetting.

Conclusion

All stages—data acquisition, cleaning, deduplication, mixing, ordering, tokenization, model design, hyper‑parameter selection, framework choice, training tricks, and evaluation—are equally critical. While infrastructure setup can be straightforward once the code runs, the creative work in data preparation often yields the biggest performance gains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data collection scaling laws LLM pretraining tokenizer design

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.