Why Small Models Can Never Match Large Models, Even with Unlimited Data
The article analyzes scaling laws and synthetic experiments to show that, due to power‑law data distributions and interference, some tasks remain unreachable for small models even with infinite data, a finding confirmed on real LLMs such as OLMo.
Modern machine‑learning consensus holds that larger models achieve lower loss, but the deeper question is why large models can learn tasks that small models never can, even when the latter are trained on unlimited data.
Theoretical Framework: Power‑Law Scaling and the "Unreachable Region"
Building on classic scaling laws, the authors distinguish two concepts:
Learnable by data scaling : a small model’s higher loss is solely due to insufficient data; more data would close the gap.
Requires model scaling : even with asymptotically infinite data, a small model cannot reach the loss level of a larger model for a portion of the data distribution.
Figure 1 illustrates this split: the purple region denotes loss levels attainable by both small (Nₛ) and large (Nₗ) models under finite resources, while the orange region marks loss levels only reachable by large models.
Synthetic Experiments: A Three‑Act Play
To probe the mechanism, the authors construct a controllable multi‑task linear‑regression setting with K orthogonal feature blocks, task frequencies following a power‑law, and task difficulty controlled by spectral decay.
Act 1 – Feature‑Utility Ordering (Theorem 3)
Theorem 3 states that models learn features in order of decreasing utility. Consequently, high‑frequency (large‑scale) and simple (fast‑decaying) tasks are learned first, while rare and complex tasks require sufficient model width to be represented.
Figure 2 shows that the observed loss trajectory matches the utility‑based theoretical prediction.
Act 2 – Resource Competition and Residual Control (Theorem 4 & Corollary 5)
When model width is large enough, the covariance of frequent tasks is fully explained, making their residual signal tiny. This frees neural‑resource capacity for rare tasks.
Gradient updates for frequent tasks become weak.
Remaining neurons are allocated to rare tasks.
Figure 3 visualizes how residual control enables rare‑task learning as the model grows.
Act 3 – Memory Retention vs. Update‑Forget Cycle (Proposition 6)
Proposition 6 highlights catastrophic interference: even when rare tasks receive occasional updates, small models quickly overwrite them with gradients from common tasks, leading to an update‑forget loop. Large models retain the injected signal and accumulate learning.
Figure 4 compares a small model (N=32) that loses the rare‑task signal after each injection with a large model (N=256) that preserves and builds upon it.
OLMo Pre‑training Validation: From Toy to Real LLM
The authors test the hypothesis on the OLMo architecture (4 M → 4 B parameters) using the Dolma v1.7 corpus with two controlled tasks:
T_CMP : compare the numeric values of two tokens.
T_ADD : modulo‑100 addition (a classic grokking task).
Results:
Small models (4 M, 20 M) fail on rare tasks, performing near random.
Large models (300 M, 1 B, 4 B) succeed, with accuracy improving as scale increases.
Tasks are learned in order of decreasing frequency.
Figures 5 and 6 present behavioral evidence of this scaling effect.
Representation Evidence
Using Distributed Alignment Search, the authors locate task‑specific features: T_CMP aligns with a 1‑dimensional subspace in the first residual stream, while T_ADD aligns with Fourier‑mode features. Larger models embed more of these features, and the amount of embedded feature correlates strongly with test accuracy (Figure 7).
Gradient Evidence
Analysis of the first MLP layer shows that large models have higher cosine similarity between batch gradients and the task direction (0.08 ± 0.02) and that gradients of non‑task tokens are nearly orthogonal, indicating minimal interference. Small models exhibit random gradient collisions (0.10 ± 0.09), confirming that common‑task updates overwrite rare‑task signals (Figures 8 and 9).
Core Hypothesis: Scaling Reduces Interference
In identical training settings, larger models learn the tail of the data distribution more effectively. When a rare task is observed, the large model retains part of the update and accumulates it on subsequent observations, whereas the small model’s parameters are quickly dominated by frequent‑task updates, creating an "update‑forget" cycle.
This leads to two practical insights:
Memorization is beneficial for rare tasks; retaining training instances is essential for generalization.
Designing data mixtures that increase the frequency of target rare tasks can be more efficient than blindly scaling model size.
Why Larger Models Learn More: Effects of Capacity, Interference, and Rare‑Task Retention
https://arxiv.org/abs/2605.29548Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
