Artificial Intelligence 17 min read

How INT8 Quantization Supercharges Baidu's Search Models: Techniques and Insights

This article explores the rapid evolution of Baidu's semantic search models, the large GPU consumption they entail, and how extensive INT8 quantization, sensitivity analysis, calibration data augmentation, hyper‑parameter auto‑tuning, and advanced methods like Quantization‑Aware Training and SmoothQuant dramatically improve inference performance while preserving business metrics.

Architecture & Thinking

Jun 30, 2023

How INT8 Quantization Supercharges Baidu's Search Models: Techniques and Insights

In recent years, Baidu's semantic models such as ERNIE have been widely deployed in search scenarios, consuming massive GPU resources and prompting intensive research on model compression.

Current Status of Search Semantic Models

ERNIE (Enhanced Representation through Knowledge Integration) was released in April 2019, achieving state‑of‑the‑art results on Chinese NLP tasks. Versions 1.0/2.0/3.0 are now used across relevance, ranking, and other sub‑domains, with hundreds of models serving full‑traffic online and undergoing near‑daily iterations.

Model Quantization Overview

Quantization reduces high‑precision storage and computation to low‑precision formats, offering smaller model size, lower bandwidth, and faster integer operations (e.g., INT8 Tensor Cores on NVIDIA Ampere GPUs).

Linear vs. non‑linear quantization: most research focuses on linear quantization, defined as Q = clip(round(R/S) + Z) where R is the high‑precision value, Q the quantized integer, S the scale, and Z the zero‑point.

Symmetric vs. asymmetric quantization: symmetric is easier to implement and faster in inference.

Quantization Granularity

Per‑layer, per‑group, and per‑channel quantization: per‑layer is simplest and most performant for inputs; per‑channel offers higher accuracy for weights.

Saturation vs. non‑saturation: weights typically use non‑saturation, while inputs/outputs with uneven distributions use saturation.

Post‑Training Quantization (PTQ) vs. Quantization‑Aware Training (QAT): PTQ provides the best cost‑performance trade‑off, while QAT is applied when PTQ loss exceeds acceptable limits.

Sensitivity Analysis

Quantization errors accumulate across layers; deeper or wider models suffer larger losses. By measuring the impact of each fully‑connected (FC) operator on end‑to‑end metrics, less sensitive operators can be skipped, achieving >30% speedup with minimal accuracy loss.

Case Studies

Case 1: Skipping the eight most sensitive FCs reduced offline metric loss to near‑zero while retaining >30% acceleration. Case 2: Skipping only the most sensitive FC restored metric loss from ~2% to acceptable levels.

Calibration Data Augmentation

Calibration data quality heavily influences quantization loss. Mixing training data from multiple sub‑tasks improves calibration for multi‑head ERNIE models, leading to balanced offline performance across tasks.

Hyper‑Parameter Auto‑Tuning

Automated search over calibration algorithms, batch sizes, and bias correction using Earth Mover’s Distance (EMD) builds a random tree of parameter sets, iteratively narrowing to optimal configurations, outperforming manual tuning.

Evaluation

Beyond traditional offline metrics, distribution‑based measures (e.g., EMD, score bucket histograms) provide a more comprehensive view of quantization impact.

Quantization‑Aware Training (QAT)

Non‑intrusive QAT restores gradients to the inference model, inserts fake‑quant ops, and uses block‑wise loss supervision to reduce quantization error without full retraining.

SmoothQuant

SmoothQuant is a training‑free, accuracy‑preserving technique that shifts quantization difficulty from activations to weights via mathematical equivalence, enabling effective INT8 quantization for large models.

Outlook

INT8 quantization is now deployed at scale, improving GPU utilization and supporting more complex models. Future work includes lower‑bit quantization (e.g., INT4), combined INT8 + token pruning, and platform‑wide pipelines to accelerate model lifecycle management, with ongoing research on PTQ‑Weight‑Only, LLM.int8(), and SmoothQuant for large language models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning model compression semantic search INT8 Quantization ERNIE

Written by

Architecture & Thinking

🍭 Frontline tech director and chief architect at top-tier companies 🥝 Years of deep experience in internet, e‑commerce, social, and finance sectors 🌾 Committed to publishing high‑quality articles covering core technologies of leading internet firms, application architecture, and AI breakthroughs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.