Artificial Intelligence 9 min read

Can Data Mixing Laws Predict LLM Performance? A Deep Dive into Scaling Laws

This article reviews the paper “Data Mixing Laws: Optimizing Data Mixture by Predicting Language Modeling Performance”, explaining how the authors quantify the impact of data mixture ratios on LLM loss, propose a simple predictive model, validate it on RedPajama and multi‑domain mixes, and outline a scaling‑law procedure for continual pre‑training.

Baobao Algorithm Notes

Mar 29, 2024

Can Data Mixing Laws Predict LLM Performance? A Deep Dive into Scaling Laws

Background

Data diversity and quality are crucial for large‑model pre‑training. The paper Data Mixing Laws: Optimizing Data Mixture by Predicting Language Modeling Performance (arXiv:2403.16952) proposes quantitative relationships between data mixture ratios and model loss.

Problem

Previous work adjusts mixture ratios heuristically; there is no predictive model linking proportion to loss.

Method

The authors start with two‑domain mixtures, measuring loss as a function of the proportion of domain A versus domain B. They fit a simple predictive function implemented as a single‑layer network with exponential activation followed by a linear transform. This function can be extrapolated from small‑scale experiments (e.g., 70 M–410 M models, up to 1 B tokens) to larger settings.

Key Findings (Two‑Domain)

On the RedPajama 1B model, the mixture predicted by the function achieves the same validation loss as the default mixture after ≈48 % fewer training steps (≈100 B tokens).

The same function can be used to schedule data dynamically during continual pre‑training, preventing catastrophic forgetting.

Extension to Multi‑Domain Mixing

Experiments with three domains (e.g., GitHub, Pile‑CC, Books3) explore functional forms that satisfy compatibility and symmetry. The selected form is mathematically equivalent to the two‑domain single‑layer network with an exp activation.

Scaling‑Law Procedure

for each mixture ratio (e.g., 1:2:3, 3:2:1, …):
    for each model size (70M, 160M, 305M, 410M):
        1. Fit a scaling law for training steps using small‑scale models to predict loss at larger steps.
        2. Fit a scaling law for model size using experimental data.
        3. Combine the two laws to predict loss for a target model (e.g., 1B) at a target token budget (e.g., 100B).

Application to Continual Pre‑Training

Using the predicted loss for different mixtures, a dynamic data‑scheduling plan is built. In experiments with Pile + Python code data on a Pythia‑70M base, the method identifies a mixture that preserves performance on the original domain while acquiring new knowledge, effectively avoiding catastrophic forgetting.

Limitations

The predictive function is empirically derived and relies on strong architectural priors (single‑layer network, exponential activation). Its theoretical justification remains open, but it demonstrates a path toward quantitative data‑engineering for LLM pre‑training.

Code repository: https://github.com/yegcjs/mixinglaws

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Data Scheduling Data Mixing Language Modeling

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.