Tagged articles
3 articles
Page 1 of 1
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 6, 2026 · Artificial Intelligence

How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU

Baidu Baige built a full‑stack quantization pipeline that integrates model‑level, framework‑level, and hardware‑level optimizations on the Kunlun XPU platform, enabling FP16/BF16 large models to be compressed to 25‑50% of their original size while boosting inference speed by 30‑50% and dramatically reducing memory consumption for enterprise deployments.

AI inferenceHardware accelerationINT4
0 likes · 16 min read
How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU
DaTaobao Tech
DaTaobao Tech
Oct 16, 2024 · Artificial Intelligence

Dynamic Quantization and Matrix Multiplication Optimization in MNN CPU Backend

The article details MNN’s CPU backend dynamic quantization for Transformer‑type models, describing runtime int8 conversion, block‑wise matrix‑multiply optimizations using ARM SMMLA/SDOT and AVX‑512 VNNI, weight‑group and batch‑wise quantization techniques, and reports up to three‑fold speed‑ups on Snapdragon 8 Gen 3.

CPU optimizationDynamic QuantizationINT8
0 likes · 19 min read
Dynamic Quantization and Matrix Multiplication Optimization in MNN CPU Backend
DaTaobao Tech
DaTaobao Tech
Nov 24, 2023 · Artificial Intelligence

Performance Optimization of Depthwise Conv Int8 on ARM CPUs

By converting the input format to a C16 layout and exploiting the ARM V8.2 Sdot instruction, the Int8 depthwise‑convolution operator on ARM CPUs can be accelerated from 4.46 ms to 1.75 ms—a 2.5× speedup—though the required data‑rearrangement overhead prevents it from overtaking FP16 performance.

ARMDepthwiseConvolutionINT8
0 likes · 10 min read
Performance Optimization of Depthwise Conv Int8 on ARM CPUs