Artificial Intelligence 13 min read

How MNN‑Sana‑Edit‑V2 Brings Comic‑Style Image Editing to Your Phone in 15 seconds

MNN‑Sana‑Edit‑V2, a collaborative effort between Taobao’s Meta team and Hangzhou University, combines a frozen Qwen3‑0.6B LLM, Learnable Query, Connector, Linear DiT and Deep Compression Autoencoder with 4/8‑bit quantization to run fully on mobile devices, delivering 512×512 comic‑style conversions in about 15 seconds—2.5× faster than cloud alternatives—while providing open‑source code, detailed training stages, and extensive performance benchmarks.

DaTaobao Tech

Apr 22, 2026

How MNN‑Sana‑Edit‑V2 Brings Comic‑Style Image Editing to Your Phone in 15 seconds

Overview

MNN‑Sana‑Edit‑V2 is an on‑device image‑to‑image editing model that converts arbitrary images to a comic‑style rendering. It builds on the SANA high‑resolution diffusion architecture and incorporates a frozen Qwen3‑0.6B large language model (LLM) for prompt understanding. The model is fully open‑sourced (GitHub, HuggingFace, ModelScope) and runs locally on mobile devices.

Network Architecture

The backbone follows the SANA design. A frozen LLM is bridged to the diffusion network via a learnable query vector and a connector module that aligns the LLM’s semantic output to the DiT latent space.

Frozen LLM : Qwen3‑0.6B is kept frozen to provide robust textual encoding without additional parameter updates.

Learnable Query : A 256‑dimensional trainable vector is concatenated with the text embedding and fed to the LLM; the last N hidden states become the conditioning vector for diffusion.

Connector Module : A transformer‑based network extracts high‑level semantics from the LLM output, followed by a linear projector that maps these features to the DiT dimension.

Reference Image Encoder : A VAE encoder converts the input image into a latent representation that guides the diffusion process.

Noise Injection : Standard Gaussian noise is added to the diffusion pipeline.

DiT Module : Jointly denoises the noise and reference latent to produce the edited output.

Learnable Query Details

The query vector is initialized from a normal distribution and trained jointly with the connector. During inference it is concatenated with the text embedding, passed through the frozen LLM, and the resulting hidden states (typically N=256) serve as the conditioning for the diffusion model.

Connector Module Details

The connector consists of a transformer that extracts semantic features from the LLM output and a linear projector that aligns these features to the DiT input space, enabling cross‑modal conditioning.

Deep Compression Autoencoder

Instead of the typical 8× compression, the model uses a 32× compression autoencoder (DC‑AE‑F32C32), drastically reducing the number of latent tokens, which speeds up training and inference and is essential for edge deployment.

Linear DiT and Mix‑FFN

Following SANA, attention layers are replaced with linear attention, reducing computational complexity from O(N²) to O(N). The Mix‑FFN augments the feed‑forward network with a depthwise 3×3 convolution and a Gated Linear Unit, and removes positional encoding (NoPE) to improve locality handling.

Training Strategy

A three‑stage pipeline is used:

Stage 1 – Pre‑training : Freeze all modules except the learnable query and connector; train on 2 M text‑image pairs for ~100 K steps.

Stage 2 – Image Generation Fine‑tuning : Unfreeze the DiT module; train on 60 K internal text‑image pairs for ~10 K steps.

Stage 3 – Image Editing Fine‑tuning : Add the reference image input; train on 10 K image‑editing pairs for ~100 K steps.

To stabilize training with a decoder‑only LLM, an RMSNorm layer and a small learnable scaling factor are inserted on the text embeddings, as recommended in the SANA paper.

On‑Device Deployment Optimizations

After PyTorch training, the model is exported to ONNX and then to MNN, which supports the required operators. Quantization settings are:

LLM weights: 4‑bit asymmetric quantization.

All other sub‑models (VAE encoder/decoder, DiT, etc.): 8‑bit asymmetric quantization.

This yields a memory footprint of ≈5.5 GB and preserves visual quality.

Speed Benchmarks

Latency for 512×512 editing on various devices (seconds):

iPhone 17 Pro (A19 Pro, iOS) – 1.4 – 4.7 s

iPhone 16 Pro (A18 Pro, iOS) – 18 s

iPhone 15 Pro (A17 Pro, iOS) – 20 s

OnePlus 13 (Snapdragon 8 Elite, Android) – 45 s

Xiaomi 12 Pro (Snapdragon 8 Gen 1, Android) – 62 s

Compared with OpenAI’s Ghibli‑style generation (38‑45 s), MNN‑Sana‑Edit‑V2 on iPhone 17 Pro achieves a 2.5× speedup.

Runtime Requirements & Hyper‑Parameters

Memory usage: ~5.5 GB.

OS: iOS A16+ or Android with Snapdragon 8+.

Input size: fixed 512×512 (square images recommended).

Prompt: fixed inside the model; altering it may degrade quality.

Generation steps: 10 steps (more steps increase latency without noticeable quality gain).

Source Code and Model Downloads

Repository: https://github.com/alibaba/MNN

Documentation: https://github.com/alibaba/MNN/blob/master/apps/sana/README.md

Model weights (HuggingFace): https://huggingface.co/taobao-mnn/MNN-Sana-Edit-V2

Model weights (ModelScope): https://modelscope.cn/models/MNN/MNN-Sana-Edit-V2

References

SANA: Efficient High‑Resolution Image Synthesis with Linear Diffusion Transformers.

Transfer between Modalities with MetaQueries.

mobile AI Image Generation diffusion edge deployment model quantization learnable query

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.