Compression as a Measure of Intelligence in Large Language Models
The article argues that a large language model's ability to compress data through next‑token prediction reflects its intelligence, reviews theoretical and empirical evidence linking compression efficiency to model scale, and proposes a circuit‑competition framework to explain emergent capabilities, in‑context learning, and fine‑tuning effects.
This article examines the debate whether GPT‑4 (and similar LLMs) merely learn shallow word‑co‑occurrence statistics or possess genuine intelligence, using the metaphor of the "octopus test" to frame two opposing viewpoints.
It explains that modern LLMs are trained with the Next Token Prediction (NTP) objective, which can be interpreted as a data‑compression task: the model predicts the next token based on preceding context, effectively learning a probabilistic representation of the data.
Through a detailed thought experiment, two agents ("Xiao Shuai" and "Xiao Mei") compress a dataset D using a GPT model. The encoder runs NTP to obtain a probability distribution P_i for each token x_i, then applies arithmetic coding (AC) to produce a lossless compressed code z_i; the decoder reconstructs x_i by reproducing P_i with an identical model and decoding z_i. The process is illustrated with pseudo‑code wrapped in ... tags.
f // GPT model code and initialization
P_i = NextTokenPrediction(f, x_{1..i-1})
z_i = ArithmeticEncode(x_i, P_i)
# Decoder side
P_i' = NextTokenPrediction(f, x_{1..i-1})
x_i = ArithmeticDecode(z_i, P_i')The article shows that the number of bits required to encode each token is –log₂(P_i(x_i)), which is exactly the cross‑entropy loss for that token; thus, compression efficiency directly mirrors the model's predictive loss and can serve as a proxy for intelligence.
Empirical evidence is presented: scaling curves for LLaMA models of various sizes demonstrate that larger models achieve lower loss (higher compression) on the same data, surpassing traditional compression benchmarks such as the Hutter Prize. This supports the claim that stronger compression correlates with higher AGI‑like capability.
The paper then surveys recent interpretability research on how Transformers store knowledge: monosemantic neurons encode single concepts, polysemantic neurons encode multiple concepts via superposition, and knowledge circuits (e.g., induction heads, greater‑than circuits) emerge through training. These findings illustrate a hierarchical knowledge structure from concrete low‑level features to abstract high‑level concepts.
Building on these observations, the author proposes the "Circuit Competition Conjecture" (CCC), suggesting that during inference many sub‑circuits compete for activation; the winning circuit determines the model's output. This framework is used to explain emergent abilities, in‑context learning, chain‑of‑thought reasoning, and the trade‑offs introduced by domain fine‑tuning or instruction tuning.
Finally, the article reflects on LLMs as mirrors of the world, arguing that next‑token prediction enables models to reconstruct a compressed representation of human knowledge, and speculates on the philosophical implications of such models acting as generators of possible worlds.
Continuous Delivery 2.0
Tech and case studies on organizational management, team management, and engineering efficiency
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.