Artificial Intelligence 12 min read

How to Effectively Continue Pretrain Large Language Models: Scaling Laws, Data Ratios, and Practical Tips

This article explains the motivations behind domain‑specific continue pretraining for large language models, outlines a three‑step workflow—including vocabulary expansion, data replay, ratio control, and scaling‑law calculations—provides concrete hyper‑parameter recommendations, and discusses challenges across different domain types and future research directions.

Baobao Algorithm Notes

Jul 10, 2024

How to Effectively Continue Pretrain Large Language Models: Scaling Laws, Data Ratios, and Practical Tips

Background

When the pre‑training data of a large language model (LLM) is already extensive, further full‑scale retraining yields diminishing returns. To improve performance on a specific domain (e.g., education, code) while preserving general capability, practitioners use continue pretraining – a second‑stage training that focuses on domain‑specific corpora.

Steps

Expand Vocabulary

Vocabulary extension is required only when the base model’s token distribution differs markedly from the target domain. For example, converting an English LLaMA checkpoint to Chinese typically demands new Chinese tokens because the original token set covers only ~5 % Chinese characters. Educational models may also need extra punctuation symbols. Users should compare the base tokenizer’s coverage with the lexical needs of the target data and add tokens accordingly.

Domain Continue Pretraining

Replay

During continue pretraining, sample a portion of the original pretraining corpus (the “replay” data). Many open‑source base models mix a small amount of instruction‑following (SFT) data in the final pretraining stage; the exact proportion is often undocumented. Omitting this SFT component in the replay set can cause a noticeable drop in general performance after domain training.

Ratio Control

Let r denote the fraction of general‑domain data and 1‑r the fraction of domain‑specific data. Experiments (Zhang Ge & Hao Ran) show that as 1‑r increases, domain loss decreases while general loss rises, eventually stabilising. Their paper provides a scaling‑law formula that predicts loss values for any ratio. By fitting the formula on a small‑scale experiment (e.g., a 1‑B parameter model with limited data), one can extrapolate expected losses for larger models.

Scaling Law

Using the fitted scaling law, estimate the number of training steps required for the loss to plateau at a satisfactory level. The predicted step count can be visualised as:

For large organisations the extra cost of continue pretraining is modest; for smaller teams a 7 B model is a realistic target, and the scaling‑law insight helps budget the required compute.

Hyper‑parameters

Keep the learning rate (LR) identical to the original pretraining LR, set warm‑up steps to zero, and increase batch size for stability. If LR decay is used, temporarily raise the LR back up during the decay phase; this may cause a brief loss spike before the curve stabilises. Specific settings can vary across base models, so practitioners should tune them empirically.

Domain Alignment

After continue pretraining, align the model to the target domain by increasing the proportion of domain‑specific SFT data during the final fine‑tuning stage. This helps recover any general‑capability loss incurred during domain training.

Domain Characteristics

Knowledge‑Centric Continue Pretraining

Key challenges include selecting the optimal domain‑data ratio and determining the total token count needed for alignment. Higher‑quality data shifts the scaling‑law parameters, so each team should re‑fit the formula on its own cleaned dataset.

Language‑Centric Continue Pretraining

When adapting LLaMA to Chinese, increase the Chinese data proportion gradually rather than applying a sudden large jump. Different knowledge types reside in different transformer layers: lower layers capture basic patterns, while higher layers encode domain‑specific details. Abrupt ratio changes can destabilise the higher layers and cause loss divergence.

Outlook

Recent pretraining advances focus on architecture (e.g., Mixture‑of‑Experts, DeepSeek’s MLA) and data‑processing pipelines. While algorithmic improvements are important, data cleaning remains a bottleneck, especially for smaller teams. Future work is likely to shift toward more efficient domain alignment, long‑context handling, and multimodal transfer, where algorithmic expertise will become decisive.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI training

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.