Overview of Recent Large Language Model Quantization Techniques
The article surveys modern post‑training quantization approaches for large language models, detailing weight‑only and activation‑aware methods such as GPTQ, AWQ, HQQ, SmoothQuant, QuIP, QuaRot, SpinQuant, QQQ, QoQ, and FP8, and compares their precision levels, algorithmic steps, accuracy‑throughput trade‑offs, and implementation considerations for efficient inference.
This article provides a comprehensive overview of post‑training quantization methods for large language models (LLMs). It begins with an introduction to model quantization, explaining why low‑precision representations (e.g., int8, int4, FP8) are essential for reducing memory footprint and inference cost of massive transformer models.
The main body systematically describes a series of recent quantization algorithms:
GPTQ – weight‑only, offline quantization supporting 4‑8‑bit precision with fast speed.
AWQ – weight‑only, offline method similar to GPTQ but with higher speed and comparable accuracy.
HQQ – half‑quadratic optimization, works without calibration data and offers the fastest quantization among weight‑only methods.
SmoothQuant – activation‑aware scaling that smooths outlier values in activations, enabling full‑weight‑and‑activation quantization (typically 8‑bit) with good throughput.
QuIP – weight‑only method based on orthogonal rotations and LDL decomposition, effective at 2‑bit precision.
QuaRot – combines offline rotation (Hadamard) and online rotation to eliminate activation outliers, supporting 4‑bit weight/activation quantization.
SpinQuant – uses learned Cayley rotations to reduce variance in quantization results, offering both offline and online Hadamard variants.
QQQ – a two‑stage scheme that integrates adaptive smoothing and Hessian‑based weight quantization, targeting W4A8 precision.
QoQ – end‑to‑end W4A8KV4 quantization with a dedicated serving system (Qserve) that fuses kernels and manages KV‑cache efficiently.
FP8 – hardware‑supported 8‑bit floating‑point formats (E4M3, E5M2) introduced by NVIDIA for both training and inference.
For each method, the article discusses quantization targets (weights, activations, KV‑cache), algorithmic steps (offline preprocessing, rotation matrices, per‑channel/group scaling, progressive group quantization), and performance metrics such as perplexity (PPL), zero‑shot accuracy, and inference throughput. Comparative tables and figures illustrate the trade‑offs between accuracy and speed across models like Llama‑2, OPT, Mistral, and others.
The article also covers practical implementation details, including:
Use of orthogonal and Hadamard matrices to preserve computation invariance.
Progressive group quantization that first applies per‑channel INT8 quantization followed by per‑group INT4.
Adaptive smoothing factors for activation scaling.
Integration of custom GEMM kernels (e.g., W4A8 per‑channel and per‑group) into serving stacks such as vLLM and Qserve.
Finally, a concise summary table lists each algorithm, its quantization objects, and key characteristics, followed by a bibliography of recent papers and technical reports.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.