MNN Stable Diffusion: On‑Device Deployment and Performance Optimizations
The article presents Alibaba’s open‑source MNN inference engine, demonstrating how quantization, operator fusion (including fused multi‑head attention, GroupNorm/SplitGeLU, Winograd convolutions), optimized GEMM and memory‑paging enable on‑device Stable Diffusion with 1‑second‑per‑step performance on Snapdragon 8 Gen3 and Apple M3 GPUs, and outlines future speed‑up directions.
With the release of ChatGPT and Stable Diffusion, generative AI has become a major trend worldwide. Stable Diffusion, a 1‑billion‑parameter model, is traditionally run on GPUs.
Advances in quantization, pruning, and the growing compute, bandwidth, and memory of mobile devices now make it feasible to deploy such large models on terminals, protecting user privacy and enabling on‑the‑fly content generation.
This article introduces the MNN deep‑learning inference engine’s open‑source Stable Diffusion application, providing the source code (https://github.com/alibaba/MNN) and usage guide (https://mnn-docs.readthedocs.io/en/latest/transformers/diffusion.html).
Accelerating Stable Diffusion can follow two directions: algorithmic improvements (network redesign, fewer inference steps) and engineering optimizations (quantization, efficient operators). MNN focuses on the latter and shares its GPU‑side performance and memory enhancements.
Self‑Attention optimization : The typical Attention block contains three Linear layers (Q/K/V) and many shape‑changing ops. MNN fuses the three Linear weights into a single matrix, enlarges the GEMM size, and merges the whole Attention into a Fused‑MultiHead‑Attention kernel, reducing 19 ops to 2 and cutting memory traffic.
GroupNorm/SplitGeLU fusion : In ResNet blocks, GroupNorm is implemented via InstanceNorm plus separate mul/add for γ/β, followed by SiLU. MNN combines Broadcast‑Binary, GroupNorm, and SiLU into one kernel, and similarly fuses the GEGLU feed‑forward sub‑graph into a SplitGeLU kernel, eliminating numerous small kernels.
Conv‑Winograd implementation : Stable Diffusion contains many 3×3 convolutions. MNN adopts the F(2,3) Winograd algorithm, achieving up to 5× compute reduction with moderate memory overhead, as illustrated in the accompanying table.
High‑performance GEMM/BatchGemm : Matrix multiplication is the core bottleneck. MNN applies block‑wise partitioning and auto‑tuning of parameters (OPWM, OPWN, OPTM, OPTN, VEC_M, VEC_N) to maximize cache reuse and computational intensity.
Memory‑usage optimization : The QK intermediate tensor in Attention can reach ~1 GB for batch 2, 16 heads, and seq‑len 4096 (fp16). MNN splits the Attention into multiple chunks (paged attention), reducing peak memory by the number of splits.
Performance evaluation : On Snapdragon 8 Gen3 GPU (float16) MNN achieves 2 s per iteration (20 steps → 40 s per 512×512 image). On Apple Mac M3 GPU (float32) it reaches 1.1 s per iteration (≈22 s per image), outperforming stable‑diffusion.cpp and Android ONNX runtimes.
Future research : Explore larger Winograd tiles, image‑based GEMM memory layouts, Flash‑Attention, low‑bit weight quantization (int8/int4), and dynamic memory reuse for further speed and memory gains.
References include related papers, GitHub projects, and benchmark suites.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.