Artificial Intelligence 15 min read

INT8 Quantization for Baidu Search Semantic Models (ERNIE)

Baidu applied large‑scale INT8 quantization to its ERNIE search semantic models, achieving over 25% inference speedup with less than 1% degradation in relevance metrics by selectively quantizing less‑sensitive fully‑connected layers, using automated calibration, hyper‑parameter tuning, and techniques such as QAT and SmoothQuant, while paving the way for even lower‑bit quantization and token pruning.

Baidu Geek Talk

Jun 26, 2023

INT8 Quantization for Baidu Search Semantic Models (ERNIE)

In recent years, semantic models such as ERNIE have been widely deployed in Baidu's search scenarios, consuming massive GPU resources. To reduce resource consumption while maintaining inference performance, business effectiveness, and iteration efficiency, large‑scale INT8 quantization has been applied to these models.

Key achievements include:

Average inference speedup >25% using INT8 quantization.

Business metric impact limited to <1% diff in critical relevance tasks.

Offline quantization pipelines can produce models within hours, supporting rapid model iteration.

The quantization process is essentially a mapping from high‑precision (float) values to low‑precision (int) representations. Its main advantages are reduced storage and memory bandwidth, as well as faster integer arithmetic, especially on NVIDIA Ampere GPUs with dedicated INT8 Tensor Cores.

Quantization techniques are categorized by:

Mapping function linearity: linear vs. non‑linear (most research focuses on linear quantization, defined as Q = clip(round(R/S) + Z) where R is the high‑precision value, S the scale, and Z the zero‑point).

Symmetry: symmetric vs. asymmetric quantization.

Granularity: per‑layer, per‑group, and per‑channel quantization.

Saturation: saturated vs. non‑saturated quantization.

Training involvement: Post‑Training Quantization (PTQ) vs. Quantization‑Aware Training (QAT).

For ERNIE models, a detailed sensitivity analysis was performed on fully‑connected (FC) operators. FCs were split into four groups: QKV‑FC, multi‑head‑attention FC, FFN FC, and business‑specific FC. By measuring Earth Mover's Distance (EMD) between original and quantized outputs, the most sensitive FCs were identified and excluded from quantization, achieving >30% speedup with <1.4% accuracy loss (Case 1) or even smaller loss when only the most sensitive FC was skipped (Case 2).

Calibration data quality also heavily influences quantization loss. For multi‑head output models, mixing calibration data from different tasks improved downstream performance.

Hyper‑parameter auto‑tuning was introduced to search optimal quantization configurations. The method samples parameter sets, builds a random tree based on EMD, and iteratively refines until convergence, outperforming manual tuning.

Beyond PTQ, Quantization‑Aware Training (QAT) and the SmoothQuant technique were explored. SmoothQuant shifts quantization difficulty from activations to weights, enabling INT8 quantization of very large models without retraining.

Future directions include lower‑bit quantization (e.g., INT4), token pruning, and platform‑level pipelines to automate the entire model lifecycle.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

model compression semantic search INT8 Quantization ERNIE Quantization-Aware Training SmoothQuant

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.