SnapGen Generates 1024px Images in 1.4 s with Lightweight On‑Device Architecture
SnapGen is a 379 M‑parameter text‑to‑image diffusion model that produces 1024 px images on mobile devices in about 1.4 seconds, using a compact U‑Net design, multi‑stage knowledge distillation, step distillation, and optimized training tricks to outperform much larger models on standard benchmarks.
SnapGen: Small‑footprint high‑resolution T2I diffusion
SnapGen is a 379 M‑parameter latent diffusion model that generates 1024 px images on a mobile device in 1.2–2.3 s (4–8 denoising steps). It achieves an FID of 2.06 at 256 px and surpasses multi‑billion‑parameter baselines (SDXL, IF‑XL) on GenEval and DPG‑Bench.
Efficient U‑Net backbone
Starting from the SDXL U‑Net, the authors systematically reduce depth and channel width and replace several components:
High‑resolution self‑attention layers are removed (only kept at the lowest resolution). This cuts FLOPs by 17 % and latency by 24 % while improving FID from 3.76 to 3.12, likely because the model converges faster without noisy high‑res attention.
All standard convolutions are swapped for expanded separable convolutions (depthwise + pointwise). Parameters drop 24 % and latency 62 %; the slight FID rise (3.12→3.38) is recovered by expanding the intermediate channel dimension with a ratio of 2, yielding a 15 % parameter reduction, 27 % FLOP reduction and 2.4× speed‑up.
The feed‑forward network (FFN) expansion ratio is trimmed from 4 to 3, decreasing parameters and FLOPs by ~12 % with negligible FID loss.
Multi‑Head Self‑Attention (MHSA) is replaced by Multi‑Query Attention (MQA), sharing keys/values across heads. This reduces parameters 16 % and latency 9 % while only a 6 % FLOP reduction, because memory‑access savings raise compute intensity.
Cross‑attention conditioning is injected from the first stage by converting the first‑stage residual block into a transformer block (CA + FFN) even though no self‑attention is present, which improves FID.
QK‑RMSNorm (query‑key RMS normalization) and 2‑D Rotary Position Embedding (RoPE) are added, providing negligible overhead but modest FID gains.
Compact decoder
The baseline SDXL/SD3 decoder is replaced by a tiny decoder that removes all attention layers, keeps only minimal GroupNorm, thins the channel dimension, and uses separable convolutions. Training uses a combination of MSE, LPIPS, and adversarial losses on 256 px patches (batch = 256, 1 M iterations). The tiny decoder yields a 35.9×–54.4× speed‑up over the baseline while preserving competitive PSNR.
Training recipe and multi‑stage knowledge distillation
Training adopts Rectified Flows (flow‑matching) as the diffusion objective, defining a linear trajectory from data x_0 to a standard normal z_T. Logit‑normal sampling concentrates timesteps in the middle of the diffusion schedule, improving stability. Distillation proceeds in two stages:
Output distillation: the student predicts the teacher’s velocity field using the large SD3.5‑Large‑Turbo model.
Feature‑level distillation: a lightweight 2‑conv projection aligns the student’s final‑layer features with the teacher’s.
To balance the multiple loss terms, a timestep‑aware scaling factor α(t) multiplies the distillation loss. α(t) is larger for difficult timesteps (t ≈ 0 or 1) and smaller for easier middle timesteps, thereby amplifying teacher supervision when needed.
Step distillation
Building on Latent Adversarial Diffusion Distillation (LADD) [22], a discriminator is initialized from the few‑step teacher (SD3.5‑Large‑Turbo). The generator is trained with a combined adversarial loss and output‑distillation loss to match the teacher’s distribution in as few as 4–8 denoising steps.
Experimental protocol
Pre‑train the U‑Net on ImageNet‑1K at 256 px (120 epochs).
Fine‑tune progressively to 512 px and then 1024 px.
Use three text encoders (CLIP‑L, CLIP‑G, Gemma2‑2b) merged into a unified embedding.
Apply the two‑stage distillation with the SD3.5‑Large teacher.
Perform step distillation with the same teacher to obtain a few‑step sampler.
Results
Quantitative performance (GenEval, DPG‑Bench, COCO CLIP score, Image‑Reward aesthetic score) shows that the 0.38 B‑parameter SnapGen outperforms SDXL (2.6 B), Playground (2.6 B) and IF‑XL (5.5 B). Knowledge distillation raises prompt‑following metrics, and step distillation achieves comparable scores with 4–8 steps versus the 28‑step baseline.
Qualitative observations indicate better text‑image alignment and fewer facial smoothing artifacts compared with existing models.
On‑device latency
On an iPhone 16 Pro‑Max, the full pipeline (tiny decoder + U‑Net) generates a 1024 px image in 1.2–2.3 s using 4–8 denoising steps. The decoder costs ≈119 ms, each U‑Net step ≈274 ms; text‑encoder latency is negligible.
Paper : SnapGen: Taming High‑Resolution Text‑to‑Image Models for Mobile Devices with Efficient Architectures and Training URL : http://arxiv.org/pdf/2412.09619 Project page : http://snap-research.github.io/snapgen/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
