How to Use Importance Sampling for Effective Continue Pretraining of LLMs
Continuing pretraining (CP) bridges pretraining and SFT to inject domain knowledge, but faces catastrophic forgetting; this article explores leveraging importance sampling to balance common and domain data, discusses data selection, annealing strategies, and practical tips for mitigating forgetting while enhancing specialized capabilities.
Background
Continue Pretrain (CP) is a training stage placed between the large‑scale pretraining phase and the supervised fine‑tuning (SFT) phase of large language models (LLMs). Its goal is to inject additional domain‑specific knowledge—such as finance, law, education, or capabilities like reasoning and generation—into a base model that still has capacity to learn new information.
Catastrophic Forgetting
When a model is trained only on domain data, its distribution shifts toward that domain and the model’s general abilities deteriorate, a phenomenon known as catastrophic forgetting. Common mitigations include partial‑model training, regularization, reduced learning rates, and ensembling, but the most reliable method is to mix in a substantial amount of “common” (general‑purpose) data during CP.
Importance Sampling
Importance sampling is a Monte‑Carlo technique for estimating expectations under a target distribution p_T(x) by drawing samples from a proposal distribution p_P(x) and re‑weighting them with the importance weight w(x)=p_T(x)/p_P(x). The proposal’s support must cover the target’s support, and the variance of w must be finite. In reinforcement learning, PPO uses importance sampling to reuse data generated by an older policy.
Combining Importance Sampling with Continue Pretraining
In CP, the original pretraining data distribution can be treated as the target p_T, while the readily available domain or open‑source data serves as the proposal p_P. By estimating the probability of a token sequence under the current CP model (proposal) and under an untrained base model (target), we obtain importance weights that correct the sampling bias.
# Pseudocode for importance‑weighted CP loss
for batch in domain_data_loader:
# x: token sequence, t tokens
log_p_target = base_model.log_prob(x) # untrained base model
log_p_proposal = cp_model.log_prob(x) # current CP model
log_w = log_p_target - log_p_proposal # log importance weight
w = torch.exp(log_w).clamp(max=MAX_WEIGHT) # optional clipping
loss = -w * log_p_proposal.mean() # weighted cross‑entropy
loss.backward()
optimizer.step()This loss encourages the CP model to approximate the original pretraining distribution while still training on easier‑to‑sample domain data.
Domain Data Selection & Annealing
Data selection. Choose domain data that closely matches the target capability (e.g., statutes, case law for a legal model). A common practice is to fine‑tune a small proxy model on candidate data and evaluate whether key metrics improve before scaling up.
Learning‑rate annealing. Dynamically decay the learning rate during training. On well‑fitted data, annealing helps the model converge quickly to a local optimum; on under‑fitted data it can exacerbate under‑fitting. A practical recipe is to apply annealing on a small subset of domain data while mixing in a large proportion of common data. This accelerates fitting on the domain subset without erasing general abilities.
Typical mixing ratios are 5‑10 % domain data and 90‑95 % common data. The annealing schedule can be cosine decay or linear decay over the CP epochs.
Other Considerations
High‑quality evaluation data is essential; a noisy or biased evaluation set can mislead optimization and cause degradation despite extensive training. Construct both training and evaluation corpora carefully, ensuring they reflect the intended downstream tasks.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
