Artificial Intelligence 26 min read

Why We Should Be Cautious About Scaling Laws in Deep Learning

The article reviews the history, theory, and empirical findings of scaling laws for neural language models, compares the Kaplan and Chinchilla formulations, discusses data‑limited regimes and fitting subtleties, and highlights why careful interpretation and resource allocation are essential for reliable predictions.

Machine Learning Algorithms & Natural Language Processing

Jun 27, 2026

Why We Should Be Cautious About Scaling Laws in Deep Learning

Scaling laws are among the most important empirical discoveries in deep learning, describing how training loss predictably decreases as model size N, dataset size D, and compute C grow, typically following a power‑law curve visible as a straight line on log‑log plots.

Early predictability – Before scaling laws became mainstream, researchers such as Amari et al. (1992) identified four types of learning curves (deterministic/no‑noise, deterministic/with‑noise, stochastic/with‑noise) that all obey power‑law relationships, laying groundwork for later empirical work.

Empirical evidence – Hestness et al. (2017) showed that across tasks (machine translation, image classification, language modeling, speech recognition) the generalization error scales as a power of data size, with model improvements shifting the curve vertically but leaving the exponent largely unchanged. Architecture changes affect the intercept E but not the exponent α.

Kaplan et al. (2020) scaling law – Formalized for Transformer language models, showing loss L scales as a power of N, D, and C. Key findings include:

Loss scales jointly with N, D, and C; all three must grow together for optimal performance.

Training curves are predictable and largely independent of model size.

Larger models are more sample‑efficient, needing fewer tokens to reach a target loss.

Architectural details matter less than sheer scale.

Under a fixed compute budget, training a very large model and stopping early is less efficient than training a smaller model to convergence – a conclusion later challenged by Chinchilla.

Chinchilla (Hoffmann et al., 2022) – Re‑examines the optimal allocation of compute under a fixed budget C, proposing that model size should grow roughly proportionally with token count (both double when compute doubles). It fits scaling laws using three complementary methods (fixed‑model scans, IsoFLOP curves, and direct parameterized fitting) and finds that many earlier large‑model runs were under‑trained.

Reconciling Kaplan and Chinchilla – The divergence stems mainly from (1) Kaplan’s experiments on relatively small models versus Chinchilla’s larger‑scale regime, and (2) treatment of embedding parameters, which dominate the parameter count for small models. Pearce & Song (2024) propose a smooth transition between non‑embedding and total parameter counts, yielding a unified expression that matches both regimes.

Data‑limited scaling – When high‑quality, deduplicated data become scarce, the classic scaling law assumptions break down. Studies (Hernandez et al., 2022; Muennighoff et al., 2023) show that repeated data can cause a “double‑dip” in loss and that effective token count D' should be modeled as a decaying function of repetition. Lovelace et al. (2026) introduce an explicit over‑fitting penalty based on the capacity ratio N/D_U, demonstrating that strong weight decay mitigates the penalty.

Practical fitting subtleties – Scaling‑law fits are highly sensitive to seemingly minor choices: parameter counting conventions, loss rounding, aggregation (sum vs. mean), and the range of models used for fitting. Small differences can lead to large prediction errors when extrapolating to orders of magnitude larger models, as illustrated by the discrepancy between Kaplan and Chinchilla.

Simulation test – A ChatGPT‑generated interactive widget demonstrates three failure modes (loss precision, noise, and fitting region sensitivity) that can dramatically affect fitted scaling‑law parameters.

References – The article cites a comprehensive list of works ranging from early learning‑curve theory (Amari 1992) to recent data‑constrained scaling studies (Muennighoff 2023, Lovelace 2026) and replication attempts (Besiroglu 2024).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning scaling laws Model Scaling Language Models Data Efficiency chinchilla Kaplan

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.