Artificial Intelligence 8 min read

How Do Large Language Models Compress Massive Data? Limits and Techniques

This article explains how large language models act like a super‑library by compressing vast amounts of text using information‑theoretic concepts, probability‑based coding, autoregressive neural networks, and arithmetic coding, while discussing accuracy, compression ratios, and theoretical limits.

Model Perspective
Model Perspective
Model Perspective
How Do Large Language Models Compress Massive Data? Limits and Techniques

During a recent AI lecture, the speaker likened large language models (LLMs) to a super‑library where asking the model is equivalent to querying a massive knowledge base. This raises questions about whether such a "library" can truly contain all world information and how accurate that information is.

Basic Concepts of Data Compression

Data compression means representing data with fewer bits. Lossless compression restores the original data exactly after decompression.

A common approach is to encode data based on its probability distribution; Huffman coding and arithmetic coding are classic examples.

Information Content and Entropy

Two fundamental notions are information content and entropy.

Information content (Information Content) measures the uncertainty of an event. An event with high probability (e.g., the letter "E" in English) carries low information; a rare event carries high information.

The information content can be expressed as:

\(I = -\log_2(p)\) where \(p\) is the probability of the event.

Entropy is the average information content of all possible events, indicating the average uncertainty of a system. Higher entropy means more complex, less predictable information.

For a discrete random variable, entropy is calculated as:

\(H = -\sum_{i} p_i \log_2(p_i)\) where \(p_i\) is the probability of each possible outcome.

Example of Data Compression

Consider a text with a vocabulary of 256 symbols (8‑bit encoding). If each symbol is equally likely, each requires 8 bits – the baseline transmission method.

Autoregressive Neural Networks and Lossless Compression

Large language models such as GPT are autoregressive neural networks that predict the next token based on previously transmitted data, effectively providing a probability distribution for the next symbol.

How Neural Networks Aid Compression

Traditional compression treats each data point independently with fixed‑length bits. Autoregressive networks learn the structure and regularities of data, allowing them to predict the next point and encode it more efficiently.

For example, if both parties share a trained network, they can use the predicted probability distribution to represent the next token with fewer bits.

Arithmetic Coding

Arithmetic coding maps each data point’s probability to an interval, shrinking the interval as more symbols are processed. Example probabilities:

Character "0": 0.20

Character "1": 0.25

Character "2": 0.22

Character "3": 0.175

The interval size for each character is proportional to its probability; repeated interval subdivision yields a binary code for the symbol.

Illustration from the book 'Revealing Large Models: From Principles to Practice'
Illustration from the book 'Revealing Large Models: From Principles to Practice'

For instance, character "3" (probability 0.175) can be encoded after several interval refinements as:

(1, 0, 1)

which uses only three bits.

Compression Ratio

Using autoregressive networks can dramatically improve compression rates. Compared to the baseline 8‑bit per symbol, the model may represent the same data with as few as 3 bits per symbol.

During training, the model minimizes the negative log‑likelihood loss, effectively learning a lossless compression of the data distribution.

Limits of Compression

Compression has theoretical limits; as datasets grow, the achievable compression ratio approaches an asymptote. When the model predicts the next token with higher precision, compression improves, eventually reaching the theoretical maximum.

For example, the Llama model compresses 5.6 TB of text to about 7.14 % of its original size. Its code is roughly 1 MB, while the training loss corresponds to 0.4 TB, yielding a substantial reduction in storage and transmission costs.

Through autoregressive networks, arithmetic coding, and continual model advances, LLMs can maintain information integrity while drastically reducing data size, and future developments are expected to push these limits even further.

AILarge Language Modelsdata compressioninformation theoryarithmetic codingautoregressive networks
Model Perspective
Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.