Can Data Mixing Laws Predict LLM Performance? A Deep Dive into Scaling Laws

This article reviews the paper “Data Mixing Laws: Optimizing Data Mixture by Predicting Language Modeling Performance”, explaining how the authors quantify the impact of data mixture ratios on LLM loss, propose a simple predictive model, validate it on RedPajama and multi‑domain mixes, and outline a scaling‑law procedure for continual pre‑training.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Can Data Mixing Laws Predict LLM Performance? A Deep Dive into Scaling Laws

Background

Data diversity and quality are crucial for large‑model pre‑training. The paper Data Mixing Laws: Optimizing Data Mixture by Predicting Language Modeling Performance (arXiv:2403.16952) proposes quantitative relationships between data mixture ratios and model loss.

Problem

Previous work adjusts mixture ratios heuristically; there is no predictive model linking proportion to loss.

Method

The authors start with two‑domain mixtures, measuring loss as a function of the proportion of domain A versus domain B. They fit a simple predictive function implemented as a single‑layer network with exponential activation followed by a linear transform. This function can be extrapolated from small‑scale experiments (e.g., 70 M–410 M models, up to 1 B tokens) to larger settings.

Key Findings (Two‑Domain)

On the RedPajama 1B model, the mixture predicted by the function achieves the same validation loss as the default mixture after ≈48 % fewer training steps (≈100 B tokens).

The same function can be used to schedule data dynamically during continual pre‑training, preventing catastrophic forgetting.

Extension to Multi‑Domain Mixing

Experiments with three domains (e.g., GitHub, Pile‑CC, Books3) explore functional forms that satisfy compatibility and symmetry. The selected form is mathematically equivalent to the two‑domain single‑layer network with an exp activation.

Multi‑domain function diagram
Multi‑domain function diagram

Scaling‑Law Procedure

for each mixture ratio (e.g., 1:2:3, 3:2:1, …):
    for each model size (70M, 160M, 305M, 410M):
        1. Fit a scaling law for training steps using small‑scale models to predict loss at larger steps.
        2. Fit a scaling law for model size using experimental data.
        3. Combine the two laws to predict loss for a target model (e.g., 1B) at a target token budget (e.g., 100B).
Scaling law diagram
Scaling law diagram

Application to Continual Pre‑Training

Using the predicted loss for different mixtures, a dynamic data‑scheduling plan is built. In experiments with Pile + Python code data on a Pythia‑70M base, the method identifies a mixture that preserves performance on the original domain while acquiring new knowledge, effectively avoiding catastrophic forgetting.

Continual pre‑training results
Continual pre‑training results

Limitations

The predictive function is empirically derived and relies on strong architectural priors (single‑layer network, exponential activation). Its theoretical justification remains open, but it demonstrates a path toward quantitative data‑engineering for LLM pre‑training.

Code repository: https://github.com/yegcjs/mixinglaws

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMData SchedulingData MixingLanguage Modeling
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.