Tagged articles

INT8

4 articles · Page 1 of 1

Jun 17, 2026 · Artificial Intelligence

Model Quantization: INT8, INT4, and AWQ/GPTQ – Choosing the Right Compression for Production

This article explains how INT8, INT4, bitsandbytes, GPTQ, and AWQ quantization methods can dramatically cut memory usage, boost inference speed, and lower costs for large language models, while detailing their trade‑offs, practical workflows, benchmark results, and common pitfalls to help engineers decide which technique best fits their production scenario.

AWQGPTQINT4

0 likes · 22 min read

Model Quantization: INT8, INT4, and AWQ/GPTQ – Choosing the Right Compression for Production

Baidu Intelligent Cloud Tech Hub

Mar 6, 2026 · Artificial Intelligence

How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU

Baidu Baige built a full‑stack quantization pipeline that integrates model‑level, framework‑level, and hardware‑level optimizations on the Kunlun XPU platform, enabling FP16/BF16 large models to be compressed to 25‑50% of their original size while boosting inference speed by 30‑50% and dramatically reducing memory consumption for enterprise deployments.

AI inferenceINT4INT8

0 likes · 16 min read

How Baidu’s End‑to‑End Quantization Stack Supercharges Large‑Model Inference on Kunlun XPU

DaTaobao Tech

Oct 16, 2024 · Artificial Intelligence

Dynamic Quantization and Matrix Multiplication Optimization in MNN CPU Backend

The article details MNN’s CPU backend dynamic quantization for Transformer‑type models, describing runtime int8 conversion, block‑wise matrix‑multiply optimizations using ARM SMMLA/SDOT and AVX‑512 VNNI, weight‑group and batch‑wise quantization techniques, and reports up to three‑fold speed‑ups on Snapdragon 8 Gen 3.

CPU optimizationDynamic QuantizationINT8

0 likes · 19 min read

Dynamic Quantization and Matrix Multiplication Optimization in MNN CPU Backend

DaTaobao Tech

Nov 24, 2023 · Artificial Intelligence

Performance Optimization of Depthwise Conv Int8 on ARM CPUs

By converting the input format to a C16 layout and exploiting the ARM V8.2 Sdot instruction, the Int8 depthwise‑convolution operator on ARM CPUs can be accelerated from 4.46 ms to 1.75 ms—a 2.5× speedup—though the required data‑rearrangement overhead prevents it from overtaking FP16 performance.

ArmDepthwiseConvolutionINT8

0 likes · 10 min read

Performance Optimization of Depthwise Conv Int8 on ARM CPUs