How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU
Baidu Baige built a full‑stack quantization pipeline that integrates model‑level, framework‑level, and hardware‑level optimizations on the Kunlun XPU platform, enabling FP16/BF16 large models to be compressed to 25‑50% of their original size while boosting inference speed by 30‑50% and dramatically reducing memory consumption for enterprise deployments.
End‑to‑End Quantization System
Large‑scale LLM inference requires hundreds of GB of VRAM for a 100‑billion‑parameter model in FP16, often needing 8‑16 accelerators. Quantization reduces model size to 25‑50 % of the original, cuts memory usage, and improves throughput by 30‑50 %, enabling the same hardware to serve 2‑4× more models or concurrent users. The system addresses three challenges: (1) accuracy loss from coarse quantization, (2) difficulty of deploying diverse quantized weight formats, and (3) performance loss without hardware support. A three‑layer collaborative optimization—model, framework, and hardware—was built on the Kunlun XPU platform.
Model‑level: automatic selection of the most suitable quantization algorithm based on model size, activation distribution, and weight sensitivity.
Framework‑level: adaptive inference engine that supports multiple quantized formats without code changes.
Hardware‑level: native XPU kernels that execute quantized data directly, avoiding costly de‑quantization.
Model Quantization Toolchain
The toolchain integrates state‑of‑the‑art algorithms (SmoothQuant, AWQ, GPTQ, RTN) and supports symmetric/asymmetric, static/dynamic, and various granularity configurations. An internal decision engine matches a model to the optimal algorithm:
RTN for ultra‑large models (e.g., DeepSeek) – training‑free, minutes‑level compression.
SmoothQuant for models with long‑tailed activation distributions (e.g., GLM) – smooths activations to stabilize quantization.
AWQ/GPTQ for weight‑sensitive models (e.g., Qwen) – protects critical weights for near‑lossless compression.
If basic quantization does not meet the target accuracy, the toolchain performs layer‑wise error analysis, falls back sensitive layers (e.g., MiMo gate_up) to FP16/BF16, and keeps the rest in INT8, achieving mixed‑precision quantization with negligible performance penalty.
For INT4 deployment, a Kunlun‑specific packed storage format aligns weight layout with the XPU’s warp/SIMD access granularity, eliminating runtime unpacking and reshaping.
Framework‑Side Quantized Inference
The vLLM‑Kunlun Plugin provides out‑of‑the‑box support for INT8 and INT4 models, automatically handling various quantization formats and model structures (dense, MoE, multimodal). Users load a quantized model without modifying code and obtain both memory savings and speed gains.
INT8 inference on the Qwen3‑235B‑A22B model with a 16k context length yields ~1.5× higher throughput than FP16 across all concurrency levels. INT4 inference loads packed weights, de‑quantizes them to FP16/BF16 for computation, and reduces memory usage to one‑quarter, allowing larger KV‑cache capacities.
Operator‑Level Quantization Acceleration
A native operator suite—static/dynamic INT8, cutlass_scaled_mm, awq_dequantize, gptq_shuffle, etc.—executes quantized formats directly on the hardware. Fused kernels such as awq_gemm, gptq_gemm, and wna16_gemm combine de‑quantization, matrix multiplication, and post‑processing into a single operation, drastically reducing memory traffic and scheduling overhead, and delivering higher throughput and lower latency for large‑scale serving.
Future extensions include support for sparse models, multimodal LLMs, KV‑cache‑aware quantization, LoRA, MTP, and PD separation, further integrating quantization into the core inference path and reducing migration costs across frameworks.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
