MNN-Transformer: Efficient On‑Device Large Language and Diffusion Model Deployment
MNN‑Transformer provides an end‑to‑end framework that enables large language and diffusion models to run efficiently on modern smartphones by exporting, quantizing (including dynamic int4/int8 and KV cache compression) and executing via a plugin‑engine runtime, achieving up to 35 tokens/s decoding and 2‑3× faster image generation compared with existing on‑device solutions.
With the continuous growth of compute, memory and storage on mobile devices, deploying large models on‑device becomes feasible. Running models locally eliminates network latency, reduces compute cost, and protects user privacy.
Overview
MNN‑Transformer (MNN‑LLM / MNN‑Diffusion) is an end‑to‑end framework that supports large language models (LLM) and text‑to‑image diffusion models on mobile. It consists of three parts: an export tool, a quantization tool, and a plugin‑engine runtime.
Key Features
Supports various LLM and diffusion models, multi‑LoRA loading, and runs on any post‑2020 smartphone without requiring vendor‑specific NPU.
Provides int4/int8 quantization and can spill excess memory to disk to avoid OOM.
Leverages recent CPU (sdot/smmla) and GPU (recordable queue, SIMD‑group, GMemory) instructions to achieve >35 tokens/s decoding on Snapdragon 8 Gen 1 for a 1.8 B model.
Offline Tools
The export tool converts PyTorch/TensorFlow models to MNN format, handling custom ONNX export scripts for large models. The quantization tool reduces model size with symmetric or asymmetric, channel‑wise or block‑wise schemes, and supports GPTQ weight quantization.
Plugins & Engine
Attention: optimized Cross‑Attention/Self‑Attention operators.
KV Manager: manages KV cache for LLM, offering allocation, expansion, quantization and pre‑loading.
LoRA: enables multiple task‑specific adapters with minimal memory overhead.
Tokenizer: SentencePiece and Tiktoken support.
Embedding, Sampler, Engine: complete inference pipeline for LLM and diffusion.
Dynamic Quantization
For models where pre‑computed quantization is impractical, MNN performs per‑batch input statistics to compute scale/bias on‑the‑fly, enabling int4/int8 weight computation with negligible accuracy loss.
Memory‑Mapping (mmap)
When the model’s memory collides with other modules, MNN can map model memory to disk, freeing RAM while keeping execution speed stable because the model and other modules rarely run concurrently.
KV Quantization
KV cache growth is mitigated by int8 quantization for keys (aligned with int8 matrix‑multiply scaling) and FP8 quantization for values on CPU back‑ends.
Performance Evaluation
Benchmarks on Android Mi 14 and iOS devices show MNN‑LLM achieving 20‑50 % faster decode than competing solutions, and >2× faster prefill. GPU performance on small models exceeds other frameworks by >30 %.
Sample benchmark table (excerpt):
Model | Backend | Prompt 64 (tok/s) | Prompt 256 (tok/s) | Prompt 1024 (tok/s)
MNN‑LLM CPU | 4‑thread | 225.19 / 53.65 | 298.89 / 52.45 | 236.76 / 45.70
MNN‑LLM GPU | OpenCL | 153.42 / 26.92 | 166.82 / 26.33 | 237.31 / 20.87Comparison with Other On‑Device LLMs
Compared against llama.cpp, MLC‑LLM and FastLLM on Qwen2‑1.5B, Qwen2‑7B and Llama3‑8B, MNN‑LLM consistently shows higher prefill speed and more stable GPU output.
MNN‑Diffusion
For on‑device diffusion, MNN‑Diffusion outperforms stable‑diffusion.cpp and on‑nxruntime by up to 3× on both Android and macOS, generating 512×512 images in ~2 s (GPU) versus >1 min (stable‑diffusion.cpp).
Examples
Interactive LLM chat and multimodal image‑generation demos are provided (see images in the original article).
Conclusion
MNN‑Transformer demonstrates that high‑performance, memory‑efficient large‑model inference is achievable on mainstream mobile hardware through dynamic quantization, KV cache optimization, and disk‑mapping techniques. Ongoing work will explore lower‑precision arithmetic and broader model support to further advance on‑device AI.
For more details, refer to the open‑source repository https://github.com/alibaba/MNN/ and the documentation https://mnn-docs.readthedocs.io/en/latest/transformers/llm.html .
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.