Artificial Intelligence 19 min read

Dynamic Quantization and Matrix Multiplication Optimization in MNN CPU Backend

The article details MNN’s CPU backend dynamic quantization for Transformer‑type models, describing runtime int8 conversion, block‑wise matrix‑multiply optimizations using ARM SMMLA/SDOT and AVX‑512 VNNI, weight‑group and batch‑wise quantization techniques, and reports up to three‑fold speed‑ups on Snapdragon 8 Gen 3.

DaTaobao Tech

Oct 16, 2024

Dynamic Quantization and Matrix Multiplication Optimization in MNN CPU Backend

This article introduces MNN's high‑performance inference engine, focusing on dynamic quantization for Transformer‑type models on mobile devices and its application to other scenarios such as speech recognition.

Dynamic quantization converts floating‑point feature maps to int8 at runtime, multiplies them with int8/4bit weights, and de‑quantizes the result back to float. The mathematical formulation, symbol definitions (x, x0, sx, y, y0, sy, c), and the de‑quantization process are described in detail.

The computation is split into three parts: (1) offline‑precomputable terms, (2) online terms, and (3) terms computed inside the matrix‑multiply kernel. The article explains how these terms are arranged and combined.

Matrix multiplication on the MNN CPU backend uses block‑wise computation to reduce cache misses. Input and weight tensors are partitioned into blocks (EP, LP, HP) and stored in memory in a specific order. The smallest compute unit is (EP, 4·LP) × (HP, 4·LP) producing an (EP, HP) output sub‑block.

CPU‑specific optimizations are discussed: Armv8.6 supports SMMLA, Armv8.2 supports SDOT, and x86_64 can use AVX‑512 VNNI. Corresponding block sizes (e.g., EP=10, LP=8, HP=8 for Armv8.6) are chosen to maximize register utilization and cache efficiency.

Weight group quantization is introduced to improve accuracy. By splitting the reduction axis L into m groups, the number of quantization parameters grows to (OC·m·2). The struct

struct QuanPostTreatParameters { const float* scale; const float* biasFloat; int32_t maxValue; int32_t minValue; int32_t useInt8 = 1; float roundValuePos = 0.5f; float roundValueNeg = -0.5f; float* srcKernelSum; float* weightQuanBias; float* fp32minmax; ssize_t blockNum = 1; const int32_t* bias; const float* extraScale = nullptr; const float* extraBias = nullptr; };

carries these parameters to the kernel.

During LLM prefill, batch‑wise quantization is applied: each sequence (batch) gets its own de‑quantization scale, stored in extraScale. The kernel multiplies the weight scale by the batch scale when extraScale is not null.

Code snippets show how the CPU backend retrieves the int8 function pointers and block sizes, e.g.,

auto core = static_cast<CPUBackend*>(backend())->int8Functions(); core->MNNGetGemmUnit(&HP, &LP, &EP);

and how the dynamic‑quantization executor selects the appropriate kernel (Int8GemmKernel, Int8GemmKernel_W4, etc.).

Performance evaluations on a Snapdragon 8 Gen 3 device demonstrate significant speed‑ups for both LLM (Qwen2‑1.5B) and CV models (MobileNetV3/V2, ResNet‑50, YOLOv4) when using dynamic quantization, with acceleration ratios ranging from ~1.1× to >3× compared to full‑precision inference.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance CPU optimization MNN Dynamic Quantization INT8 LLM Inference Matrix multiplication

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.