Mobile Development 15 min read

MNN-Transformer: Efficient On‑Device Large Language and Diffusion Model Deployment

MNN‑Transformer provides an end‑to‑end framework that enables large language and diffusion models to run efficiently on modern smartphones by exporting, quantizing (including dynamic int4/int8 and KV cache compression) and executing via a plugin‑engine runtime, achieving up to 35 tokens/s decoding and 2‑3× faster image generation compared with existing on‑device solutions.

DaTaobao Tech

Nov 20, 2024

MNN-Transformer: Efficient On‑Device Large Language and Diffusion Model Deployment

With the continuous growth of compute, memory and storage on mobile devices, deploying large models on‑device becomes feasible. Running models locally eliminates network latency, reduces compute cost, and protects user privacy.

Overview

MNN‑Transformer (MNN‑LLM / MNN‑Diffusion) is an end‑to‑end framework that supports large language models (LLM) and text‑to‑image diffusion models on mobile. It consists of three parts: an export tool, a quantization tool, and a plugin‑engine runtime.

Key Features

Supports various LLM and diffusion models, multi‑LoRA loading, and runs on any post‑2020 smartphone without requiring vendor‑specific NPU.

Provides int4/int8 quantization and can spill excess memory to disk to avoid OOM.

Leverages recent CPU (sdot/smmla) and GPU (recordable queue, SIMD‑group, GMemory) instructions to achieve >35 tokens/s decoding on Snapdragon 8 Gen 1 for a 1.8 B model.

Offline Tools

The export tool converts PyTorch/TensorFlow models to MNN format, handling custom ONNX export scripts for large models. The quantization tool reduces model size with symmetric or asymmetric, channel‑wise or block‑wise schemes, and supports GPTQ weight quantization.

Plugins & Engine

Attention: optimized Cross‑Attention/Self‑Attention operators.

KV Manager: manages KV cache for LLM, offering allocation, expansion, quantization and pre‑loading.

LoRA: enables multiple task‑specific adapters with minimal memory overhead.

Tokenizer: SentencePiece and Tiktoken support.

Embedding, Sampler, Engine: complete inference pipeline for LLM and diffusion.

Dynamic Quantization

For models where pre‑computed quantization is impractical, MNN performs per‑batch input statistics to compute scale/bias on‑the‑fly, enabling int4/int8 weight computation with negligible accuracy loss.

Memory‑Mapping (mmap)

When the model’s memory collides with other modules, MNN can map model memory to disk, freeing RAM while keeping execution speed stable because the model and other modules rarely run concurrently.

KV Quantization

KV cache growth is mitigated by int8 quantization for keys (aligned with int8 matrix‑multiply scaling) and FP8 quantization for values on CPU back‑ends.

Performance Evaluation

Benchmarks on Android Mi 14 and iOS devices show MNN‑LLM achieving 20‑50 % faster decode than competing solutions, and >2× faster prefill. GPU performance on small models exceeds other frameworks by >30 %.

Sample benchmark table (excerpt):

Model          | Backend      | Prompt 64 (tok/s) | Prompt 256 (tok/s) | Prompt 1024 (tok/s)
MNN‑LLM CPU   | 4‑thread    | 225.19 / 53.65    | 298.89 / 52.45    | 236.76 / 45.70
MNN‑LLM GPU   | OpenCL       | 153.42 / 26.92    | 166.82 / 26.33    | 237.31 / 20.87

Comparison with Other On‑Device LLMs

Compared against llama.cpp, MLC‑LLM and FastLLM on Qwen2‑1.5B, Qwen2‑7B and Llama3‑8B, MNN‑LLM consistently shows higher prefill speed and more stable GPU output.

MNN‑Diffusion

For on‑device diffusion, MNN‑Diffusion outperforms stable‑diffusion.cpp and on‑nxruntime by up to 3× on both Android and macOS, generating 512×512 images in ~2 s (GPU) versus >1 min (stable‑diffusion.cpp).

Examples

Interactive LLM chat and multimodal image‑generation demos are provided (see images in the original article).

Conclusion

MNN‑Transformer demonstrates that high‑performance, memory‑efficient large‑model inference is achievable on mainstream mobile hardware through dynamic quantization, KV cache optimization, and disk‑mapping techniques. Ongoing work will explore lower‑precision arithmetic and broader model support to further advance on‑device AI.

For more details, refer to the open‑source repository https://github.com/alibaba/MNN/ and the documentation

https://mnn-docs.readthedocs.io/en/latest/transformers/llm.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

mobile AI LLM Quantization diffusion MNN

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.