Artificial Intelligence 10 min read

Why Small Models Can Never Match Large Models, Even with Unlimited Data

The article analyzes scaling laws and synthetic experiments to show that, due to power‑law data distributions and interference, some tasks remain unreachable for small models even with infinite data, a finding confirmed on real LLMs such as OLMo.

PaperAgent

Jun 9, 2026

Why Small Models Can Never Match Large Models, Even with Unlimited Data

Modern machine‑learning consensus holds that larger models achieve lower loss, but the deeper question is why large models can learn tasks that small models never can, even when the latter are trained on unlimited data.

Theoretical Framework: Power‑Law Scaling and the "Unreachable Region"

Building on classic scaling laws, the authors distinguish two concepts:

Learnable by data scaling : a small model’s higher loss is solely due to insufficient data; more data would close the gap.

Requires model scaling : even with asymptotically infinite data, a small model cannot reach the loss level of a larger model for a portion of the data distribution.

Figure 1 illustrates this split: the purple region denotes loss levels attainable by both small (Nₛ) and large (Nₗ) models under finite resources, while the orange region marks loss levels only reachable by large models.

Synthetic Experiments: A Three‑Act Play

To probe the mechanism, the authors construct a controllable multi‑task linear‑regression setting with K orthogonal feature blocks, task frequencies following a power‑law, and task difficulty controlled by spectral decay.

Act 1 – Feature‑Utility Ordering (Theorem 3)

Theorem 3 states that models learn features in order of decreasing utility. Consequently, high‑frequency (large‑scale) and simple (fast‑decaying) tasks are learned first, while rare and complex tasks require sufficient model width to be represented.

Figure 2 shows that the observed loss trajectory matches the utility‑based theoretical prediction.

Act 2 – Resource Competition and Residual Control (Theorem 4 & Corollary 5)

When model width is large enough, the covariance of frequent tasks is fully explained, making their residual signal tiny. This frees neural‑resource capacity for rare tasks.

Gradient updates for frequent tasks become weak.

Remaining neurons are allocated to rare tasks.

Figure 3 visualizes how residual control enables rare‑task learning as the model grows.

Act 3 – Memory Retention vs. Update‑Forget Cycle (Proposition 6)

Proposition 6 highlights catastrophic interference: even when rare tasks receive occasional updates, small models quickly overwrite them with gradients from common tasks, leading to an update‑forget loop. Large models retain the injected signal and accumulate learning.

Figure 4 compares a small model (N=32) that loses the rare‑task signal after each injection with a large model (N=256) that preserves and builds upon it.

OLMo Pre‑training Validation: From Toy to Real LLM

The authors test the hypothesis on the OLMo architecture (4 M → 4 B parameters) using the Dolma v1.7 corpus with two controlled tasks:

T_CMP : compare the numeric values of two tokens.

T_ADD : modulo‑100 addition (a classic grokking task).

Results:

Small models (4 M, 20 M) fail on rare tasks, performing near random.

Large models (300 M, 1 B, 4 B) succeed, with accuracy improving as scale increases.

Tasks are learned in order of decreasing frequency.

Figures 5 and 6 present behavioral evidence of this scaling effect.

Representation Evidence

Using Distributed Alignment Search, the authors locate task‑specific features: T_CMP aligns with a 1‑dimensional subspace in the first residual stream, while T_ADD aligns with Fourier‑mode features. Larger models embed more of these features, and the amount of embedded feature correlates strongly with test accuracy (Figure 7).

Gradient Evidence

Analysis of the first MLP layer shows that large models have higher cosine similarity between batch gradients and the task direction (0.08 ± 0.02) and that gradients of non‑task tokens are nearly orthogonal, indicating minimal interference. Small models exhibit random gradient collisions (0.10 ± 0.09), confirming that common‑task updates overwrite rare‑task signals (Figures 8 and 9).

Core Hypothesis: Scaling Reduces Interference

In identical training settings, larger models learn the tail of the data distribution more effectively. When a rare task is observed, the large model retains part of the update and accumulates it on subsequent observations, whereas the small model’s parameters are quickly dominated by frequent‑task updates, creating an "update‑forget" cycle.

This leads to two practical insights:

Memorization is beneficial for rare tasks; retaining training instances is essential for generalization.

Designing data mixtures that increase the frequency of target rare tasks can be more efficient than blindly scaling model size.

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare‑Task Retention
https://arxiv.org/abs/2605.29548

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models scaling laws interference model capacity rare task learning synthetic experiments

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Theoretical Framework: Power‑Law Scaling and the "Unreachable Region"

Synthetic Experiments: A Three‑Act Play

Act 1 – Feature‑Utility Ordering (Theorem 3)

Act 2 – Resource Competition and Residual Control (Theorem 4 & Corollary 5)

Act 3 – Memory Retention vs. Update‑Forget Cycle (Proposition 6)

OLMo Pre‑training Validation: From Toy to Real LLM

Representation Evidence

Gradient Evidence

Core Hypothesis: Scaling Reduces Interference

PaperAgent

How this landed with the community

Was this worth your time?

0 Comments

Act 1 – Feature‑Utility Ordering (Theorem 3)

Act 2 – Resource Competition and Residual Control (Theorem 4 & Corollary 5)

Act 3 – Memory Retention vs. Update‑Forget Cycle (Proposition 6)