Artificial Intelligence 58 min read

Compression as a Measure of Intelligence in Large Language Models

The article argues that a large language model's ability to compress data through next‑token prediction reflects its intelligence, reviews theoretical and empirical evidence linking compression efficiency to model scale, and proposes a circuit‑competition framework to explain emergent capabilities, in‑context learning, and fine‑tuning effects.

Continuous Delivery 2.0

Sep 12, 2023

Compression as a Measure of Intelligence in Large Language Models

This article examines the debate whether GPT‑4 (and similar LLMs) merely learn shallow word‑co‑occurrence statistics or possess genuine intelligence, using the metaphor of the "octopus test" to frame two opposing viewpoints.

It explains that modern LLMs are trained with the Next Token Prediction (NTP) objective, which can be interpreted as a data‑compression task: the model predicts the next token based on preceding context, effectively learning a probabilistic representation of the data.

Through a detailed thought experiment, two agents ("Xiao Shuai" and "Xiao Mei") compress a dataset D using a GPT model. The encoder runs NTP to obtain a probability distribution P_i for each token x_i, then applies arithmetic coding (AC) to produce a lossless compressed code z_i; the decoder reconstructs x_i by reproducing P_i with an identical model and decoding z_i. The process is illustrated with pseudo‑code wrapped in ... tags.

f   // GPT model code and initialization
P_i = NextTokenPrediction(f, x_{1..i-1})
z_i = ArithmeticEncode(x_i, P_i)
# Decoder side
P_i' = NextTokenPrediction(f, x_{1..i-1})
x_i = ArithmeticDecode(z_i, P_i')

The article shows that the number of bits required to encode each token is –log₂(P_i(x_i)), which is exactly the cross‑entropy loss for that token; thus, compression efficiency directly mirrors the model's predictive loss and can serve as a proxy for intelligence.

Empirical evidence is presented: scaling curves for LLaMA models of various sizes demonstrate that larger models achieve lower loss (higher compression) on the same data, surpassing traditional compression benchmarks such as the Hutter Prize. This supports the claim that stronger compression correlates with higher AGI‑like capability.

The paper then surveys recent interpretability research on how Transformers store knowledge: monosemantic neurons encode single concepts, polysemantic neurons encode multiple concepts via superposition, and knowledge circuits (e.g., induction heads, greater‑than circuits) emerge through training. These findings illustrate a hierarchical knowledge structure from concrete low‑level features to abstract high‑level concepts.

Building on these observations, the author proposes the "Circuit Competition Conjecture" (CCC), suggesting that during inference many sub‑circuits compete for activation; the winning circuit determines the model's output. This framework is used to explain emergent abilities, in‑context learning, chain‑of‑thought reasoning, and the trade‑offs introduced by domain fine‑tuning or instruction tuning.

Finally, the article reflects on LLMs as mirrors of the world, arguing that next‑token prediction enables models to reconstruct a compressed representation of human knowledge, and speculates on the philosophical implications of such models acting as generators of possible worlds.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM GPT-4 compression Intelligence Neural Circuits Next Token Prediction

Written by

Continuous Delivery 2.0

Tech and case studies on organizational management, team management, and engineering efficiency

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.