How INT8 Quantization Supercharges Baidu's Search Models: Techniques and Insights
This article explores the rapid evolution of Baidu's semantic search models, the large GPU consumption they entail, and how extensive INT8 quantization, sensitivity analysis, calibration data augmentation, hyper‑parameter auto‑tuning, and advanced methods like Quantization‑Aware Training and SmoothQuant dramatically improve inference performance while preserving business metrics.
In recent years, Baidu's semantic models such as ERNIE have been widely deployed in search scenarios, consuming massive GPU resources and prompting intensive research on model compression.
Current Status of Search Semantic Models
ERNIE (Enhanced Representation through Knowledge Integration) was released in April 2019, achieving state‑of‑the‑art results on Chinese NLP tasks. Versions 1.0/2.0/3.0 are now used across relevance, ranking, and other sub‑domains, with hundreds of models serving full‑traffic online and undergoing near‑daily iterations.
Model Quantization Overview
Quantization reduces high‑precision storage and computation to low‑precision formats, offering smaller model size, lower bandwidth, and faster integer operations (e.g., INT8 Tensor Cores on NVIDIA Ampere GPUs).
Linear vs. non‑linear quantization: most research focuses on linear quantization, defined as Q = clip(round(R/S) + Z) where R is the high‑precision value, Q the quantized integer, S the scale, and Z the zero‑point.
Symmetric vs. asymmetric quantization: symmetric is easier to implement and faster in inference.
Quantization Granularity
Per‑layer, per‑group, and per‑channel quantization: per‑layer is simplest and most performant for inputs; per‑channel offers higher accuracy for weights.
Saturation vs. non‑saturation: weights typically use non‑saturation, while inputs/outputs with uneven distributions use saturation.
Post‑Training Quantization (PTQ) vs. Quantization‑Aware Training (QAT): PTQ provides the best cost‑performance trade‑off, while QAT is applied when PTQ loss exceeds acceptable limits.
Sensitivity Analysis
Quantization errors accumulate across layers; deeper or wider models suffer larger losses. By measuring the impact of each fully‑connected (FC) operator on end‑to‑end metrics, less sensitive operators can be skipped, achieving >30% speedup with minimal accuracy loss.
Case Studies
Case 1: Skipping the eight most sensitive FCs reduced offline metric loss to near‑zero while retaining >30% acceleration. Case 2: Skipping only the most sensitive FC restored metric loss from ~2% to acceptable levels.
Calibration Data Augmentation
Calibration data quality heavily influences quantization loss. Mixing training data from multiple sub‑tasks improves calibration for multi‑head ERNIE models, leading to balanced offline performance across tasks.
Hyper‑Parameter Auto‑Tuning
Automated search over calibration algorithms, batch sizes, and bias correction using Earth Mover’s Distance (EMD) builds a random tree of parameter sets, iteratively narrowing to optimal configurations, outperforming manual tuning.
Evaluation
Beyond traditional offline metrics, distribution‑based measures (e.g., EMD, score bucket histograms) provide a more comprehensive view of quantization impact.
Quantization‑Aware Training (QAT)
Non‑intrusive QAT restores gradients to the inference model, inserts fake‑quant ops, and uses block‑wise loss supervision to reduce quantization error without full retraining.
SmoothQuant
SmoothQuant is a training‑free, accuracy‑preserving technique that shifts quantization difficulty from activations to weights via mathematical equivalence, enabling effective INT8 quantization for large models.
Outlook
INT8 quantization is now deployed at scale, improving GPU utilization and supporting more complex models. Future work includes lower‑bit quantization (e.g., INT4), combined INT8 + token pruning, and platform‑wide pipelines to accelerate model lifecycle management, with ongoing research on PTQ‑Weight‑Only, LLM.int8(), and SmoothQuant for large language models.
Architecture & Thinking
🍭 Frontline tech director and chief architect at top-tier companies 🥝 Years of deep experience in internet, e‑commerce, social, and finance sectors 🌾 Committed to publishing high‑quality articles covering core technologies of leading internet firms, application architecture, and AI breakthroughs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.