How MNN’s Sparse Computing Boosts Mobile AI Inference Performance
This article details the design and implementation of sparse computation in Alibaba’s MNN inference engine, covering weight sparsity techniques, block‑sparse layouts, performance benchmarks on MobileNet models versus XNNPack, and real‑world deployment cases that demonstrate significant speedups and memory savings on mobile CPUs.
1. Sparse Layout and Acceleration Principle
In the inference engine field, Alibaba’s Mobile Neural Network (MNN) has become a leading solution. To further improve performance, combining deep‑learning model design with scientific and high‑performance computing leads to sparse‑computing techniques. Sparsity means that a data matrix contains zero elements; after sparsification, non‑zero elements are non‑contiguous in memory, preventing direct reuse of GEMM and reducing cache‑hit efficiency, thus requiring new methods to accelerate sparse relative to dense computation.
1.1 Adjustable Block Sparse Weights
Deep neural network sparsity includes input, output, and weight sparsity. Weight sparsity, typically 30%‑90% for deep‑learning models, is the focus in MNN. The design chooses general weight sparsity, targets sparsity levels of 30%‑90% (excluding extreme scientific‑computing sparsity), and adopts a combination of random sparsity and block sparsity rather than fully structured channel pruning.
1.2 Weight Matrix Compression Format
MNN supports multiple weight‑matrix layouts. The original dense weight matrix is shown in Fig 3. For random sparsity, the layout in Fig 4 is used; for block sparsity, Fig 5 is applied. Non‑zero data are stored together with row‑start and column‑index arrays, while zero elements are omitted, reducing memory usage.
2. MNN Sparse Computing Scheme Design
2.1 Inference Framework Layer Design
The sparse solution is integrated into the existing MNN architecture, considering five main design goals: (1) algorithm‑model stage, (2) sparse‑training stage using the mnncompress tool, (3) model‑conversion stage, (4) MNN engine inference stage, and (5) testing across dense and sparse cases with varied block sizes, sparsity levels, and backends.
2.2 MNN Sparse Computing Architecture
The overall pipeline consists of four loosely coupled stages: sparse training, parameter conversion, operator framework, and backend kernels. UML diagrams (container layer and component layer) illustrate the modular structure, enabling easy extension and integration.
2.3 Operator and Backend Implementation
Operator registration does not introduce a new “sparse convolution” op; the existing convolution op is reused, allowing flexible selection between sparse and dense execution.
At the operator level, dense convolution is split into two layers to separate reusable and extensible parts, enabling sparse‑compute compression.
Quantized sparse operators are built on top of ConvInt8Tiled with similar methodology.
Backend kernels are implemented in assembly for six platforms: ARM32 fp32/int8, ARM64 fp32/int8, x86 AVX2 fp32, and x86 AVX‑512 fp32.
Testing covers all dense cases plus sparse block dimensions, verifying correctness across backends.
3. Sparse Computing Performance Evaluation
3.1 Typical Model Sparse Acceleration
Four dimensions—sparsity, block size, CPU model, and model type—are evaluated. Results (Fig 8) show inference latency decreasing linearly with increasing sparsity. On a Qualcomm SD 835 (Mi 6) device, a 1×4 block at 90% sparsity yields up to 3.71× speedup for MobileNet V2 and up to 4.13× for other models. Compared with XNNPack, MNN achieves higher acceleration ratios (e.g., MobileNet V1: 2.57‑3.96× vs XNNPack 2.35×).
On Xiaomi 6, sparsity 0.1 with 1×4 block already reaches the acceleration threshold; at 0.9 sparsity, speedup can reach 4.13×.
Inference time scales linearly with sparsity across models and CPUs.
Memory usage drops proportionally with sparsity (see Fig 9).
4. Business Model Practice
4.1 Image Super‑Resolution Service
The workflow includes: (1) training the super‑resolution model, (2) applying mnncompress with sparsity parameters, (3) converting the model with MNN convertor, and (4) deploying via the MNN workbench. At 45% sparsity, inference speedup is ~1.2× with PSNR dropping only from 34.728 dB to 34.502 dB, an acceptable loss.
4.2 Speech Model
For a speech model evaluated on AVX‑512, 0.75 sparsity yields encoder 1 acceleration of 2.82× and encoder 2 acceleration of 3.02×, meeting expected performance gains.
5. Summary and Outlook
We have designed a generic sparse convolution scheme for MNN that outperforms XNNPack, achieving 3.16‑4.13× speedup on ARM devices at 90% sparsity across various models and CPUs. Accuracy loss is limited and acceptable in real‑world scenarios. Inference latency decreases linearly with sparsity, and memory consumption reduces proportionally. Ongoing work continues to refine kernel assembly, data layouts, and SIMD optimizations, further advancing MNN’s sparse inference capabilities for both mobile and server environments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
