Why We Should Be Cautious About Scaling Laws in Deep Learning

The article reviews the history, theory, and empirical findings of scaling laws for neural language models, compares the Kaplan and Chinchilla formulations, discusses data‑limited regimes and fitting subtleties, and highlights why careful interpretation and resource allocation are essential for reliable predictions.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Why We Should Be Cautious About Scaling Laws in Deep Learning

Scaling laws are among the most important empirical discoveries in deep learning, describing how training loss predictably decreases as model size N, dataset size D, and compute C grow, typically following a power‑law curve visible as a straight line on log‑log plots.

Early predictability – Before scaling laws became mainstream, researchers such as Amari et al. (1992) identified four types of learning curves (deterministic/no‑noise, deterministic/with‑noise, stochastic/with‑noise) that all obey power‑law relationships, laying groundwork for later empirical work.

Empirical evidence – Hestness et al. (2017) showed that across tasks (machine translation, image classification, language modeling, speech recognition) the generalization error scales as a power of data size, with model improvements shifting the curve vertically but leaving the exponent largely unchanged. Architecture changes affect the intercept E but not the exponent α.

Scaling law framework
Scaling law framework

Kaplan et al. (2020) scaling law – Formalized for Transformer language models, showing loss L scales as a power of N, D, and C. Key findings include:

Loss scales jointly with N, D, and C; all three must grow together for optimal performance.

Training curves are predictable and largely independent of model size.

Larger models are more sample‑efficient, needing fewer tokens to reach a target loss.

Architectural details matter less than sheer scale.

Under a fixed compute budget, training a very large model and stopping early is less efficient than training a smaller model to convergence – a conclusion later challenged by Chinchilla.

Kaplan scaling law equation
Kaplan scaling law equation

Chinchilla (Hoffmann et al., 2022) – Re‑examines the optimal allocation of compute under a fixed budget C, proposing that model size should grow roughly proportionally with token count (both double when compute doubles). It fits scaling laws using three complementary methods (fixed‑model scans, IsoFLOP curves, and direct parameterized fitting) and finds that many earlier large‑model runs were under‑trained.

Chinchilla scaling law illustration
Chinchilla scaling law illustration

Reconciling Kaplan and Chinchilla – The divergence stems mainly from (1) Kaplan’s experiments on relatively small models versus Chinchilla’s larger‑scale regime, and (2) treatment of embedding parameters, which dominate the parameter count for small models. Pearce & Song (2024) propose a smooth transition between non‑embedding and total parameter counts, yielding a unified expression that matches both regimes.

Embedding vs total parameters
Embedding vs total parameters

Data‑limited scaling – When high‑quality, deduplicated data become scarce, the classic scaling law assumptions break down. Studies (Hernandez et al., 2022; Muennighoff et al., 2023) show that repeated data can cause a “double‑dip” in loss and that effective token count D' should be modeled as a decaying function of repetition. Lovelace et al. (2026) introduce an explicit over‑fitting penalty based on the capacity ratio N/D_U, demonstrating that strong weight decay mitigates the penalty.

Over‑fitting penalty diagram
Over‑fitting penalty diagram

Practical fitting subtleties – Scaling‑law fits are highly sensitive to seemingly minor choices: parameter counting conventions, loss rounding, aggregation (sum vs. mean), and the range of models used for fitting. Small differences can lead to large prediction errors when extrapolating to orders of magnitude larger models, as illustrated by the discrepancy between Kaplan and Chinchilla.

Simulation test – A ChatGPT‑generated interactive widget demonstrates three failure modes (loss precision, noise, and fitting region sensitivity) that can dramatically affect fitted scaling‑law parameters.

References – The article cites a comprehensive list of works ranging from early learning‑curve theory (Amari 1992) to recent data‑constrained scaling studies (Muennighoff 2023, Lovelace 2026) and replication attempts (Besiroglu 2024).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep Learningscaling lawsModel ScalingLanguage ModelsData EfficiencychinchillaKaplan
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.