How SANA 1.5 Lets Small Models Reach New Text‑to‑Image SOTA
SANA 1.5 introduces an efficient model‑growth pipeline, depth‑pruning, and inference‑time scaling that reuse a 1.6 B‑parameter foundation to train a 4.8 B model with 8× lower memory, 60 % less training time, and GenEval scores that rival or surpass much larger diffusion models.
Key Contributions
Efficient Model Growth : Scales parameters from 1.6 B (20 blocks) to 4.8 B (60 blocks) while reusing knowledge from the smaller model; training time drops by ~60 % compared with training from scratch (see Fig 2).
Model‑Depth Pruning : Analyzes block importance via input‑output similarity, removes low‑impact blocks, and fine‑tunes for 100 steps to recover quality, compressing a 4.8 B model to 40/30/20‑block variants (Fig 6‑7).
Inference‑Time Scaling : Replaces parameter scaling with extra compute at inference, raising GenEval from 0.72 to 0.80 and matching larger models without increasing parameters (Fig 8‑11).
CAME‑8bit Optimizer : Combines CAME with 8‑bit block‑wise quantisation, cutting optimizer memory to ~1/8 of AdamW‑32bit while keeping 32‑bit precision for second‑order statistics, enabling billion‑parameter training on a single RTX 4090.
Training and Fine‑Tuning Pipeline
The authors evaluate three initialization strategies for expanding the model: Partial Preservation Init (keep early layers, randomly init new ones), Cyclic Replication Init (repeat pretrained layers periodically), and Block Replication Init (copy a pretrained block according to an expansion ratio). Partial Preservation Init is chosen for its simplicity and stability, with new layers zero‑initialised (Identity Mapping) to act as a perfect identity at the start of training.
RMSNorm is applied to Query and Key in both Linear and Cross‑Attention modules to stabilise early training and integration of new layers.
Depth‑Pruning Method
Block importance is measured by averaging input‑output similarity across diffusion timesteps and a calibration dataset. Head and tail blocks show high importance, while middle blocks are more redundant. Pruning follows this ranking, and a brief 100‑step fine‑tune restores performance, allowing a 1.6 B model to reach the quality of the full 4.8 B model.
Inference‑Time Scaling Strategy
Instead of increasing denoising steps, the authors increase the number of sampled candidates and use a Visual Language Model (VLM) to rank them. Scaling sampling candidates proves more effective than scaling steps, as extra steps cannot correct early errors (Fig 7a) and quickly plateau in quality (Fig 7b).
Commercial multi‑modal APIs (GPT‑4o, Gemini‑1.5‑pro) exhibited inconsistent scoring and strong bias toward the first candidate. The authors therefore fine‑tuned NVILA‑2B on a custom dataset to provide reliable image‑text alignment scores.
Experimental Setup
The final SANA‑4.8B model keeps the same channel width (2240) and FFN dimension (5600) as the 1.6 B baseline. Training follows the same data and hyper‑parameters, with a two‑stage process: large‑scale pre‑training followed by supervised fine‑tuning on high‑quality data (3 M samples selected from 50 M pre‑training images with CLIP score > 25).
Evaluation metrics include FID, CLIP Score, GenEval, and DPG‑Bench on the MJHQ‑30K dataset and dedicated text‑to‑image benchmarks.
Results
Scaling from 1.6 B to 4.8 B yields a GenEval gain of +0.06 (0.66 → 0.72), FID reduction of 0.34 (5.76 → 5.42), and DPG increase of +0.2 (84.8 → 85.0). SANA‑4.8B matches or exceeds larger models such as Playground v3 (24 B) and FLUX (12 B) while using far less compute (latency 5.5× lower than FLUX‑dev).
Pruned models fine‑tuned for 100 steps recover most of the quality loss, outperforming from‑scratch training at the same compute budget. Inference‑time scaling further improves accuracy by up to 8 % on GenEval and yields 4 % higher overall accuracy than Playground v3 despite the smaller model size.
Limitations
The main drawback is increased inference compute: sampling N images requires N × GFLOPs for generation plus additional GFLOPs for VLM judgment. Future work will explore more efficient sampling and ranking mechanisms.
References
SANA: Efficient high‑resolution image synthesis with linear diffusion transformers
LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
