Can Data Mixing Laws Predict LLM Performance? A Deep Dive into Scaling Laws
This article reviews the paper “Data Mixing Laws: Optimizing Data Mixture by Predicting Language Modeling Performance”, explaining how the authors quantify the impact of data mixture ratios on LLM loss, propose a simple predictive model, validate it on RedPajama and multi‑domain mixes, and outline a scaling‑law procedure for continual pre‑training.
Background
Data diversity and quality are crucial for large‑model pre‑training. The paper Data Mixing Laws: Optimizing Data Mixture by Predicting Language Modeling Performance (arXiv:2403.16952) proposes quantitative relationships between data mixture ratios and model loss.
Problem
Previous work adjusts mixture ratios heuristically; there is no predictive model linking proportion to loss.
Method
The authors start with two‑domain mixtures, measuring loss as a function of the proportion of domain A versus domain B. They fit a simple predictive function implemented as a single‑layer network with exponential activation followed by a linear transform. This function can be extrapolated from small‑scale experiments (e.g., 70 M–410 M models, up to 1 B tokens) to larger settings.
Key Findings (Two‑Domain)
On the RedPajama 1B model, the mixture predicted by the function achieves the same validation loss as the default mixture after ≈48 % fewer training steps (≈100 B tokens).
The same function can be used to schedule data dynamically during continual pre‑training, preventing catastrophic forgetting.
Extension to Multi‑Domain Mixing
Experiments with three domains (e.g., GitHub, Pile‑CC, Books3) explore functional forms that satisfy compatibility and symmetry. The selected form is mathematically equivalent to the two‑domain single‑layer network with an exp activation.
Scaling‑Law Procedure
for each mixture ratio (e.g., 1:2:3, 3:2:1, …):
for each model size (70M, 160M, 305M, 410M):
1. Fit a scaling law for training steps using small‑scale models to predict loss at larger steps.
2. Fit a scaling law for model size using experimental data.
3. Combine the two laws to predict loss for a target model (e.g., 1B) at a target token budget (e.g., 100B).Application to Continual Pre‑Training
Using the predicted loss for different mixtures, a dynamic data‑scheduling plan is built. In experiments with Pile + Python code data on a Pythia‑70M base, the method identifies a mixture that preserves performance on the original domain while acquiring new knowledge, effectively avoiding catastrophic forgetting.
Limitations
The predictive function is empirically derived and relies on strong architectural priors (single‑layer network, exponential activation). Its theoretical justification remains open, but it demonstrates a path toward quantitative data‑engineering for LLM pre‑training.
Code repository: https://github.com/yegcjs/mixinglaws
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
