How Does Quantization Shrink LLMs? A Deep Dive into GPTQ, GGUF, and Techniques
This article explains why large language models need quantization, describes the core concepts, classification schemes, symmetric and asymmetric methods, handling of outliers, and compares post‑training quantization (PTQ) with quantization‑aware training (QAT), while detailing popular techniques such as GPTQ, GGUF, and BitNet.
1. Qualitative Conclusions
Large language models (LLMs) often contain billions of parameters and require high‑VRAM GPUs for inference. Consequently, many research efforts focus on reducing model size through quantization, which compresses model weights from 32‑bit floating point to lower‑bit representations, cutting memory usage and inference latency.
Quantization works by preserving the distinguishability of each parameter while reducing the number of bits used to represent them. The process is irreversible; once values are rounded to lower precision, some information is lost.
Analogies: reducing image resolution (e.g., from 4K to 1080p) or using a coarse kitchen scale illustrate how fewer bits lead to less detail but lower cost.
2. Quantization Issues of Large LLMs
LLMs can contain tens or hundreds of billions of parameters, requiring hundreds of gigabytes of memory when stored as 32‑bit floats. For example, a 700‑billion‑parameter model would need roughly 280 GB of RAM.
Reducing the bit‑width of parameters (e.g., to INT8 or INT4) dramatically lowers storage and bandwidth requirements, but lower precision can degrade accuracy.
3. What Is Quantization?
Quantization reduces the precision of model parameters from high‑bit formats (FP32) to lower‑bit formats (FP16, BF16, INT8, INT4, etc.). The goal is to keep the original model behavior as much as possible while using fewer bits.
Common data types:
FP16 : 16‑bit floating point, smaller range than FP32.
BF16 : 16‑bit with the same exponent range as FP32, widely used in deep learning.
INT8 : 8‑bit integer, reduces storage to one quarter of FP32.
4. From FP32 to INT8
Two main linear mapping methods are used: symmetric quantization (centered around zero) and asymmetric quantization (maps the minimum and maximum of the floating‑point range to the integer range).
Symmetric quantization computes a scale factor s = α / 127 where α is the maximum absolute value, then quantizes each value x as q = round(x / s). De‑quantization restores the original value with x̂ = q * s.
Asymmetric quantization also computes a zero‑point z to shift the integer range, using the formula q = round((x / s) + z). This allows the integer range to cover non‑zero‑centered data but adds complexity.
5. Handling Outliers
When a few values are far larger than the rest, mapping the entire range to a low‑bit representation can cause most small values to collapse into the same quantized bucket. A common remedy is to clamp the dynamic range (e.g., to [-5, 5]) so that extreme outliers are mapped to the extreme integer values, reducing error for the majority of values.
6. PTQ with GPTQ and GGUF
GPTQ is a popular 4‑bit post‑training quantization method that uses asymmetric quantization per layer. It first computes the inverse Hessian (second‑order derivative) for each layer to estimate weight importance, then quantizes weights while minimizing a weighted error.
GGUF enables off‑loading parts of a model to CPU memory. It groups weights into “super‑blocks” and “sub‑blocks”, quantizes each sub‑block with an associated scale factor, and stores the scale of the super‑block for higher‑precision reconstruction.
7. QAT in Training
Quantization‑aware training (QAT) inserts fake quantization nodes during training: weights are quantized (e.g., to INT4) and immediately de‑quantized back to FP32, allowing the optimizer to see the quantization error and adjust weights accordingly.
QAT typically yields higher accuracy than PTQ because the model learns to operate within the quantized space.
8. Extreme 1‑Bit Quantization – BitNet
BitNet represents weights with a single bit (‑1 or +1) and stores activations in INT8. It replaces standard linear layers with “BitLinear” layers that perform the same matrix multiplication but with binary weights, dramatically reducing memory while preserving performance for very large models (>30 B parameters).
9. Summary
The article visualizes the quantization pipeline, showing how high‑precision parameters are mapped to lower‑precision representations, the trade‑offs involved, and the practical methods (PTQ, QAT, GPTQ, GGUF, BitNet) used to compress modern LLMs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
