Lilian Weng’s Deep Dive into Scaling Laws for Large‑Model Training

The article explains how scaling laws serve as a budget guide for training large language models, comparing Kaplan’s and Chinchilla’s findings, illustrating optimal parameter‑token trade‑offs, and highlighting the impact of data quality and duplication on model performance.

PaperAgent
PaperAgent
PaperAgent
Lilian Weng’s Deep Dive into Scaling Laws for Large‑Model Training

Scaling laws as a budgeting tool

Training a multi‑billion‑parameter model can cost tens of millions of dollars, and the biggest risk is spending the budget on the wrong aspect. Scaling laws are not a magic formula; they are a budget table that helps decide whether to allocate compute to more parameters, more tokens, or longer training.

The basic relationship is expressed as C ≈ 6ND, where N is the number of parameters, D the number of training tokens, and C the total compute. Doubling N or D roughly doubles C. The core question is how to split a fixed compute budget between N and D.

Lilian Weng, former OpenAI executive, Peking University graduate
Lilian Weng, former OpenAI executive, Peking University graduate

Kaplan’s original scaling law

Kaplan et al. observed that language‑model loss follows a stable power‑law decline with respect to parameters, data, and compute. This gave the intuition that small‑model experiments can predict large‑model performance, allowing teams to estimate the direction of a costly pre‑training run before spending the money.

Kaplan scaling law plots
Kaplan scaling law plots

However, a smooth curve does not guarantee reliable extrapolation.

Transformer parameters and FLOPs estimation table
Transformer parameters and FLOPs estimation table

Kaplan vs. Chinchilla: different philosophies

Kaplan concluded that, under a fixed compute budget, one should aggressively increase model size and stop training early, yielding the relation N_opt ∝ C^0.73 (parameters grow faster than data).

Chinchilla revisited the problem by training over 400 models covering 70 M–16 B+ parameters and 5 B–500 B tokens, using three estimation methods (fixed‑model‑size token scaling, IsoFLOP curves, and parametric fitting). The result was N_opt ∝ C^0.5, meaning parameters and tokens should grow more synchronously.

Chinchilla three fitting methods
Chinchilla three fitting methods

The most cited example compares Gopher (280 B parameters, 300 B tokens) with Chinchilla (70 B parameters, 1.4 T tokens). At comparable compute, Chinchilla outperforms Gopher, showing that many early large models were undertrained—lacking sufficient token exposure rather than lacking parameters.

Undertrained models under Chinchilla rule
Undertrained models under Chinchilla rule

The key message of scaling laws is not that bigger is always better, but that under a fixed budget the right proportion of parameters to tokens is more likely to be effective.

Why the conclusions differ

Three reasons are highlighted:

Kaplan extrapolated from relatively small models; extrapolation error grows with distance.

Different parameter counting: Kaplan used non‑embedding parameters, while Chinchilla counted total parameters, a discrepancy that matters more for smaller models.

Fitting details matter: choices such as Huber‑loss aggregation, early‑stop criteria, and the precision of hyper‑parameters alpha and beta can shift the fitted curve.

Thus, a scaling law may look like a smooth line, but it rests on a whole suite of measurement and fitting choices; unstable measurements can give a false sense of certainty.

Data scarcity and token quality

Classic scaling laws assume abundant “fresh” tokens, but high‑quality data are becoming scarce while duplication rises. A trillion unique tokens differ fundamentally from a trillion repeated tokens.

Muennighoff et al. separate total tokens into unique and repeated data, treating repeated tokens as having diminishing marginal value—much like rereading a book provides less new information each time.

Data-constrained scaling
Data-constrained scaling

Lovelace et al. link repetition to overfitting: when a model is much larger than the amount of unique data and sees many repeats, the overfitting penalty rises, meaning larger models are not necessarily safer in a high‑duplication regime.

Repetition overfit residuals
Repetition overfit residuals

Modern large‑model training pipelines involve filtering, deduplication, quality scoring, safety processing, copyright handling, benchmark de‑contamination, and weighted data mixtures. Ignoring token quality and duplication when applying scaling laws can omit the most critical variables.

Scaling Laws, Carefully
https://lilianweng.github.io/posts/2026-06-24-scaling-laws/
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Large Language ModelsData Qualityscaling lawschinchillaCompute BudgetKaplan
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.