Artificial Intelligence 17 min read

How SANA 1.5 Lets Small Models Reach New Text‑to‑Image SOTA

SANA 1.5 introduces an efficient model‑growth pipeline, depth‑pruning, and inference‑time scaling that reuse a 1.6 B‑parameter foundation to train a 4.8 B model with 8× lower memory, 60 % less training time, and GenEval scores that rival or surpass much larger diffusion models.

AIWalker

Mar 15, 2025

How SANA 1.5 Lets Small Models Reach New Text‑to‑Image SOTA

Key Contributions

Efficient Model Growth : Scales parameters from 1.6 B (20 blocks) to 4.8 B (60 blocks) while reusing knowledge from the smaller model; training time drops by ~60 % compared with training from scratch (see Fig 2).

Model‑Depth Pruning : Analyzes block importance via input‑output similarity, removes low‑impact blocks, and fine‑tunes for 100 steps to recover quality, compressing a 4.8 B model to 40/30/20‑block variants (Fig 6‑7).

Inference‑Time Scaling : Replaces parameter scaling with extra compute at inference, raising GenEval from 0.72 to 0.80 and matching larger models without increasing parameters (Fig 8‑11).

CAME‑8bit Optimizer : Combines CAME with 8‑bit block‑wise quantisation, cutting optimizer memory to ~1/8 of AdamW‑32bit while keeping 32‑bit precision for second‑order statistics, enabling billion‑parameter training on a single RTX 4090.

Training and Fine‑Tuning Pipeline

The authors evaluate three initialization strategies for expanding the model: Partial Preservation Init (keep early layers, randomly init new ones), Cyclic Replication Init (repeat pretrained layers periodically), and Block Replication Init (copy a pretrained block according to an expansion ratio). Partial Preservation Init is chosen for its simplicity and stability, with new layers zero‑initialised (Identity Mapping) to act as a perfect identity at the start of training.

RMSNorm is applied to Query and Key in both Linear and Cross‑Attention modules to stabilise early training and integration of new layers.

Depth‑Pruning Method

Block importance is measured by averaging input‑output similarity across diffusion timesteps and a calibration dataset. Head and tail blocks show high importance, while middle blocks are more redundant. Pruning follows this ranking, and a brief 100‑step fine‑tune restores performance, allowing a 1.6 B model to reach the quality of the full 4.8 B model.

Block importance heatmap (BI) for SANA 4.8B

Inference‑Time Scaling Strategy

Instead of increasing denoising steps, the authors increase the number of sampled candidates and use a Visual Language Model (VLM) to rank them. Scaling sampling candidates proves more effective than scaling steps, as extra steps cannot correct early errors (Fig 7a) and quickly plateau in quality (Fig 7b).

Commercial multi‑modal APIs (GPT‑4o, Gemini‑1.5‑pro) exhibited inconsistent scoring and strong bias toward the first candidate. The authors therefore fine‑tuned NVILA‑2B on a custom dataset to provide reliable image‑text alignment scores.

Comparison of scaling denoising steps vs. scaling sampling noise with VLM judgment

Experimental Setup

The final SANA‑4.8B model keeps the same channel width (2240) and FFN dimension (5600) as the 1.6 B baseline. Training follows the same data and hyper‑parameters, with a two‑stage process: large‑scale pre‑training followed by supervised fine‑tuning on high‑quality data (3 M samples selected from 50 M pre‑training images with CLIP score > 25).

Evaluation metrics include FID, CLIP Score, GenEval, and DPG‑Bench on the MJHQ‑30K dataset and dedicated text‑to‑image benchmarks.

Results

Scaling from 1.6 B to 4.8 B yields a GenEval gain of +0.06 (0.66 → 0.72), FID reduction of 0.34 (5.76 → 5.42), and DPG increase of +0.2 (84.8 → 85.0). SANA‑4.8B matches or exceeds larger models such as Playground v3 (24 B) and FLUX (12 B) while using far less compute (latency 5.5× lower than FLUX‑dev).

Efficiency and performance comparison with SOTA methods

Pruned models fine‑tuned for 100 steps recover most of the quality loss, outperforming from‑scratch training at the same compute budget. Inference‑time scaling further improves accuracy by up to 8 % on GenEval and yields 4 % higher overall accuracy than Playground v3 despite the smaller model size.

Limitations

The main drawback is increased inference compute: sampling N images requires N × GFLOPs for generation plus additional GFLOPs for VLM judgment. Future work will explore more efficient sampling and ranking mechanisms.

References

SANA: Efficient high‑resolution image synthesis with linear diffusion transformers

LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Text-to-Image pruning diffusion model scaling inference scaling efficient training

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.