How MNN‑Sana‑Edit‑V2 Brings Comic‑Style Image Editing to Your Phone in 15 seconds
MNN‑Sana‑Edit‑V2, a collaborative effort between Taobao’s Meta team and Hangzhou University, combines a frozen Qwen3‑0.6B LLM, Learnable Query, Connector, Linear DiT and Deep Compression Autoencoder with 4/8‑bit quantization to run fully on mobile devices, delivering 512×512 comic‑style conversions in about 15 seconds—2.5× faster than cloud alternatives—while providing open‑source code, detailed training stages, and extensive performance benchmarks.
Overview
MNN‑Sana‑Edit‑V2 is an on‑device image‑to‑image editing model that converts arbitrary images to a comic‑style rendering. It builds on the SANA high‑resolution diffusion architecture and incorporates a frozen Qwen3‑0.6B large language model (LLM) for prompt understanding. The model is fully open‑sourced (GitHub, HuggingFace, ModelScope) and runs locally on mobile devices.
Network Architecture
The backbone follows the SANA design. A frozen LLM is bridged to the diffusion network via a learnable query vector and a connector module that aligns the LLM’s semantic output to the DiT latent space.
Frozen LLM : Qwen3‑0.6B is kept frozen to provide robust textual encoding without additional parameter updates.
Learnable Query : A 256‑dimensional trainable vector is concatenated with the text embedding and fed to the LLM; the last N hidden states become the conditioning vector for diffusion.
Connector Module : A transformer‑based network extracts high‑level semantics from the LLM output, followed by a linear projector that maps these features to the DiT dimension.
Reference Image Encoder : A VAE encoder converts the input image into a latent representation that guides the diffusion process.
Noise Injection : Standard Gaussian noise is added to the diffusion pipeline.
DiT Module : Jointly denoises the noise and reference latent to produce the edited output.
Learnable Query Details
The query vector is initialized from a normal distribution and trained jointly with the connector. During inference it is concatenated with the text embedding, passed through the frozen LLM, and the resulting hidden states (typically N=256) serve as the conditioning for the diffusion model.
Connector Module Details
The connector consists of a transformer that extracts semantic features from the LLM output and a linear projector that aligns these features to the DiT input space, enabling cross‑modal conditioning.
Deep Compression Autoencoder
Instead of the typical 8× compression, the model uses a 32× compression autoencoder (DC‑AE‑F32C32), drastically reducing the number of latent tokens, which speeds up training and inference and is essential for edge deployment.
Linear DiT and Mix‑FFN
Following SANA, attention layers are replaced with linear attention, reducing computational complexity from O(N²) to O(N). The Mix‑FFN augments the feed‑forward network with a depthwise 3×3 convolution and a Gated Linear Unit, and removes positional encoding (NoPE) to improve locality handling.
Training Strategy
A three‑stage pipeline is used:
Stage 1 – Pre‑training : Freeze all modules except the learnable query and connector; train on 2 M text‑image pairs for ~100 K steps.
Stage 2 – Image Generation Fine‑tuning : Unfreeze the DiT module; train on 60 K internal text‑image pairs for ~10 K steps.
Stage 3 – Image Editing Fine‑tuning : Add the reference image input; train on 10 K image‑editing pairs for ~100 K steps.
To stabilize training with a decoder‑only LLM, an RMSNorm layer and a small learnable scaling factor are inserted on the text embeddings, as recommended in the SANA paper.
On‑Device Deployment Optimizations
After PyTorch training, the model is exported to ONNX and then to MNN, which supports the required operators. Quantization settings are:
LLM weights: 4‑bit asymmetric quantization.
All other sub‑models (VAE encoder/decoder, DiT, etc.): 8‑bit asymmetric quantization.
This yields a memory footprint of ≈5.5 GB and preserves visual quality.
Speed Benchmarks
Latency for 512×512 editing on various devices (seconds):
iPhone 17 Pro (A19 Pro, iOS) – 1.4 – 4.7 s
iPhone 16 Pro (A18 Pro, iOS) – 18 s
iPhone 15 Pro (A17 Pro, iOS) – 20 s
OnePlus 13 (Snapdragon 8 Elite, Android) – 45 s
Xiaomi 12 Pro (Snapdragon 8 Gen 1, Android) – 62 s
Compared with OpenAI’s Ghibli‑style generation (38‑45 s), MNN‑Sana‑Edit‑V2 on iPhone 17 Pro achieves a 2.5× speedup.
Runtime Requirements & Hyper‑Parameters
Memory usage: ~5.5 GB.
OS: iOS A16+ or Android with Snapdragon 8+.
Input size: fixed 512×512 (square images recommended).
Prompt: fixed inside the model; altering it may degrade quality.
Generation steps: 10 steps (more steps increase latency without noticeable quality gain).
Source Code and Model Downloads
Repository: https://github.com/alibaba/MNN
Documentation: https://github.com/alibaba/MNN/blob/master/apps/sana/README.md
Model weights (HuggingFace): https://huggingface.co/taobao-mnn/MNN-Sana-Edit-V2
Model weights (ModelScope): https://modelscope.cn/models/MNN/MNN-Sana-Edit-V2
References
SANA: Efficient High‑Resolution Image Synthesis with Linear Diffusion Transformers.
Transfer between Modalities with MetaQueries.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
