Artificial Intelligence 8 min read

How to Use Importance Sampling for Effective Continue Pretraining of LLMs

Continuing pretraining (CP) bridges pretraining and SFT to inject domain knowledge, but faces catastrophic forgetting; this article explores leveraging importance sampling to balance common and domain data, discusses data selection, annealing strategies, and practical tips for mitigating forgetting while enhancing specialized capabilities.

Baobao Algorithm Notes

Oct 25, 2024

How to Use Importance Sampling for Effective Continue Pretraining of LLMs

Background

Continue Pretrain (CP) is a training stage placed between the large‑scale pretraining phase and the supervised fine‑tuning (SFT) phase of large language models (LLMs). Its goal is to inject additional domain‑specific knowledge—such as finance, law, education, or capabilities like reasoning and generation—into a base model that still has capacity to learn new information.

Catastrophic Forgetting

When a model is trained only on domain data, its distribution shifts toward that domain and the model’s general abilities deteriorate, a phenomenon known as catastrophic forgetting. Common mitigations include partial‑model training, regularization, reduced learning rates, and ensembling, but the most reliable method is to mix in a substantial amount of “common” (general‑purpose) data during CP.

Importance Sampling

Importance sampling is a Monte‑Carlo technique for estimating expectations under a target distribution p_T(x) by drawing samples from a proposal distribution p_P(x) and re‑weighting them with the importance weight w(x)=p_T(x)/p_P(x). The proposal’s support must cover the target’s support, and the variance of w must be finite. In reinforcement learning, PPO uses importance sampling to reuse data generated by an older policy.

Combining Importance Sampling with Continue Pretraining

In CP, the original pretraining data distribution can be treated as the target p_T, while the readily available domain or open‑source data serves as the proposal p_P. By estimating the probability of a token sequence under the current CP model (proposal) and under an untrained base model (target), we obtain importance weights that correct the sampling bias.

# Pseudocode for importance‑weighted CP loss
for batch in domain_data_loader:
    # x: token sequence, t tokens
    log_p_target = base_model.log_prob(x)          # untrained base model
    log_p_proposal = cp_model.log_prob(x)          # current CP model
    log_w = log_p_target - log_p_proposal          # log importance weight
    w = torch.exp(log_w).clamp(max=MAX_WEIGHT)    # optional clipping
    loss = -w * log_p_proposal.mean()              # weighted cross‑entropy
    loss.backward()
    optimizer.step()

This loss encourages the CP model to approximate the original pretraining distribution while still training on easier‑to‑sample domain data.

Domain Data Selection & Annealing

Data selection. Choose domain data that closely matches the target capability (e.g., statutes, case law for a legal model). A common practice is to fine‑tune a small proxy model on candidate data and evaluate whether key metrics improve before scaling up.

Learning‑rate annealing. Dynamically decay the learning rate during training. On well‑fitted data, annealing helps the model converge quickly to a local optimum; on under‑fitted data it can exacerbate under‑fitting. A practical recipe is to apply annealing on a small subset of domain data while mixing in a large proportion of common data. This accelerates fitting on the domain subset without erasing general abilities.

Typical mixing ratios are 5‑10 % domain data and 90‑95 % common data. The annealing schedule can be cosine decay or linear decay over the CP epochs.

Other Considerations

High‑quality evaluation data is essential; a noisy or biased evaluation set can mislead optimization and cause degradation despite extensive training. Construct both training and evaluation corpora carefully, ensuring they reflect the intended downstream tasks.

LLM Domain Adaptation Importance Sampling Catastrophic Forgetting Continue Pretraining

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.