Quantization Across Signal Processing, AI Inference, and RAG Vector Search

This article explains how quantization—originating from signal processing—reduces precision to save resources, details its application to neural network weights and activations via PTQ, QAT, GPTQ, AWQ, and SmoothQuant, and shows how vector quantization enables fast, memory‑efficient retrieval in large‑scale RAG systems.

AI Engineer Programming
AI Engineer Programming
AI Engineer Programming
Quantization Across Signal Processing, AI Inference, and RAG Vector Search

What Quantization Is

Quantization maps continuous high‑precision values to a limited discrete range, a concept from signal processing. For example, a RAW photo (16‑bit per channel) compressed to JPEG (8‑bit) halves the file size with negligible visual loss. Linear quantization uses a scale to set grid width and a zero_point for zero alignment; the quantization error ε satisfies |ε| ≤ scale/2 and is statistically predictable when the data distribution is uniform.

Common numeric formats:

FP32 – 32 bit, ~4.3 billion representable values, range ±3.4×10³⁸ – training standard.

BF16 – 16 bit, 65 536 values, same exponent range as FP32 – training acceleration.

FP16 – 16 bit, range ±65 504 – inference / mixed‑precision training.

INT8 – 8 bit, 256 values, range −128 ~ 127 – common inference quantization.

INT4 – 4 bit, 16 values, range −8 ~ 7 – extreme LLM compression.

FP4 / NF4 – 4 bit, 16 values, non‑linear distribution – QLoRA fine‑tuning.

Quantization in Neural Networks

Weights are highly concentrated around zero; most of the 4.3 billion FP32 values are unnecessary. Networks are trained with noise (SGD, dropout, data augmentation) and therefore tolerate reduced precision. Converting FP32 to INT8 typically incurs less than 1 % accuracy loss.

Granularity

Per‑tensor : one scale and zero_point for the whole layer – lowest overhead, highest error.

Per‑channel : independent scale per output channel – reduces error, widely used.

Per‑group : groups of channels share a scale (group size 64 or 128) – balances accuracy and efficiency; core of GPTQ and AWQ.

Post‑Training Quantization (PTQ)

PTQ assumes a trained model and applies quantization in four steps:

Prepare a small calibration dataset (hundreds to thousands of representative samples).

Run a forward pass and record per‑layer activation minima and maxima (or use KL‑divergence to find optimal cut‑points).

Compute scale and zero_point from the statistics.

Convert the model: weights are stored quantized; activations are dynamically quantized at inference.

PTQ is fast (minutes‑hours) and requires no GPU cluster, but accuracy loss can be larger for layers with complex activation distributions.

Quantization‑Aware Training (QAT)

QAT inserts fake‑quant nodes during training so that activations are quantized and de‑quantized in the forward pass. Because rounding is non‑differentiable, QAT uses a Straight‑Through Estimator (STE) that treats the gradient of rounding as 1, allowing gradients to flow through the quantization operation. In practice this lets the network adapt its weights to the quantization constraints.

QAT typically limits accuracy loss to below 0.1 % of FP32 at the cost of 10‑30 % extra training time.

Choosing PTQ vs QAT

Speed : PTQ is fast; QAT requires retraining.

Accuracy : PTQ INT8 loss < 1 % (INT4 may be larger); QAT INT8 is almost lossless, INT4 often acceptable.

Scenario : PTQ for resource‑constrained quick deployment; QAT for high‑precision requirements.

Data need : PTQ needs only a few calibration samples; QAT needs the full training dataset.

Activation Quantization

Activations vary with each input, making static quantization difficult. PTQ’s calibration may misrepresent out‑of‑distribution inputs. Practitioners often combine INT8 weight quantization with INT8 or FP16 activations, or use dynamic quantization that recomputes activation ranges at inference time.

LLM Quantization Techniques

Large language models contain outlier activation channels whose magnitude can be dozens to hundreds of times larger than typical values, often linked to special tokens such as [SEP] and [CLS]. Using a global scale based on these outliers compresses normal values into a few quantization bins, degrading precision.

GPTQ

GPTQ treats quantization error as a budget to be actively compensated. After quantizing a column of weights, it adjusts the remaining unquantized weights in the same layer so that their quantization offsets cancel the introduced error. Error impact is measured with a Hessian‑based metric, and a Cholesky decomposition propagates the compensation efficiently. The result is a layer‑wise error far smaller than independent per‑weight quantization.

AWQ

AWQ identifies roughly 1 % of “important weights” whose associated activations have large magnitude. These weights receive finer (higher‑bit) quantization, while the remaining 99 % are aggressively quantized to INT4. Implementation scales the important channels before quantization and rescales after, preserving relative precision without changing the network’s output.

SmoothQuant

SmoothQuant moves part of the activation quantization difficulty to the weights. It inserts a diagonal scaling matrix diag(s) into the matrix multiplication Y = XW as: Y = (X · diag(s)⁻¹) · (diag(s) · W) Choosing s as the geometric mean of activation and weight magnitudes compresses activation outliers (via diag(s)⁻¹) while keeping the scaled weights diag(s)·W within a range suitable for per‑channel quantization. Consequently both activations and weights become easier to quantize statically.

Vector Quantization for Retrieval‑Augmented Generation

The objective is to search billions of 1536‑dim vectors in milliseconds while preserving recall.

Problem Scale

10 M documents × 1536‑dim FP32 vectors ≈ 58 GB storage.

Exact cosine similarity over 10 M vectors requires billions of FLOPs per query, taking minutes.

Scalar Quantization (SQ)

Quantize each dimension from FP32 to INT8, achieving roughly 4× memory reduction (58 GB → ≈ 14.5 GB) and 2‑4× faster SIMD integer dot‑products. Retrieval accuracy loss is negligible.

Binary Quantization (BQ)

Store only the sign of each dimension (0/1). A 1536‑dim vector becomes 192 bytes (32× reduction). Similarity is computed as Hamming distance using a single popcount instruction, which is extremely fast but incurs larger accuracy loss; suitable for a coarse first pass.

Product Quantization (PQ)

Divide each vector into M sub‑spaces, run k‑means (typically k=256) in each sub‑space, and store only the codeword indices. A 1536‑dim vector is represented by M integers (commonly M=32), i.e., 32 bytes (≈192× compression). At query time, a distance table dist_table[m][k] is pre‑computed, and the approximate distance is:

dist(q, x) ≈ Σₘ distance_table[m][codeword_m(x)]

This reduces per‑vector computation to table look‑ups and additions, eliminating floating‑point multiplications.

Two‑Stage Search

Typical pipelines first use HNSW + PQ or BQ to retrieve ~500 candidate vectors from a billion in milliseconds, then compute exact FP32 cosine similarity on those candidates to produce the final Top‑K results.

Vector Database Quantization Support

FAISS – supports SQ8, BQ, IVFPQ.

Qdrant – supports SQ8, BQ, IVFPQ.

Weaviate – supports SQ8, BQ, IVFPQ.

Pinecone – supports SQ8 and IVFPQ (no BQ).

pgvector – supports SQ8 only.

Milvus – supports SQ8, BQ, IVFPQ.

Conclusion

Quantization across signal processing, neural networks, and vector retrieval all address the same fundamental question: how much precision is truly needed? PTQ, QAT, GPTQ, AWQ, SmoothQuant, SQ, BQ, and PQ are concrete strategies that allocate a limited precision budget under various resource constraints.

References:

Frantar et al., 2022. GPTQ: Accurate Post‑Training Quantization for Generative Pre‑trained Transformers.

Lin et al., 2023. AWQ: Activation‑aware Weight Quantization for LLM Compression and Acceleration.

Xiao et al., 2022. SmoothQuant: Accurate and Efficient Post‑Training Quantization for Large Language Models.

Jégou et al., 2011. Product Quantization for Nearest Neighbor Search.

Jacob et al., 2018. Quantization and Training of Neural Networks for Efficient Integer‑Arithmetic‑Only Inference.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMpost‑training quantizationQuantizationGPTQVector SearchSmoothQuantAWQ
AI Engineer Programming
Written by

AI Engineer Programming

In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.