How SANA 1.5’s Efficient Linear Diffusion Transformer Sets a New SOTA in Text‑to‑Image Generation

The paper introduces SANA 1.5, an efficient linear diffusion transformer that scales training and inference compute via model growth, depth‑wise pruning, and inference‑time scaling, achieving a GenEval score of 0.80 and matching larger models while using far less resources.

AIWalker
AIWalker
AIWalker
How SANA 1.5’s Efficient Linear Diffusion Transformer Sets a New SOTA in Text‑to‑Image Generation

Overview

Text‑to‑image diffusion models have grown dramatically in size, improving quality but incurring huge training and inference costs. Recent industry models have expanded from PixArt’s 0.6 B parameters to Playground v3’s 24 B, making high‑quality generation unaffordable for most practitioners.

Introduction

This article presents SANA‑1.5, an efficient linear diffusion transformer for text‑to‑image generation. Building on SANA‑1.0, it adds three key innovations: (1) an efficient training‑time scaling paradigm that grows parameters from 1.6 B to 4.8 B while cutting compute, coupled with an 8‑bit memory‑efficient optimizer; (2) model‑depth pruning based on block‑importance analysis to compress the model to arbitrary sizes with minimal quality loss; (3) an inference‑time scaling strategy that uses repeated sampling to let smaller models match the quality of larger ones. On the GenEval benchmark, SANA‑1.5 reaches a text‑image alignment score of 0.72, which rises to 0.80 with inference scaling, establishing a new state‑of‑the‑art.

Method and Model

1. Overview

SANA‑1.5 adopts a complementary three‑strategy approach rather than training a gigantic model from scratch. First, the base model is expanded to more transformer layers while preserving learned knowledge. During inference, (i) depth‑wise pruning keeps only important transformer blocks, enabling flexible configurations and low‑cost fine‑tuning; (ii) inference scaling uses repeated sampling and VLM‑guided selection to balance compute and model capacity. The memory‑efficient CAME‑8bit optimizer makes fine‑tuning a 1‑billion‑parameter model on a single consumer‑grade GPU feasible.

2. Efficient Model Growth

Instead of training a large model from scratch, SANA‑1.5 grows a pretrained DiT from n layers to m layers, preserving its knowledge. Three initialization strategies are explored:

Partial‑retain initialization: keep the first k pretrained layers and randomly initialize the remaining m‑k layers, applying special handling to key components.

Cyclic copy initialization: periodically repeat pretrained layers for the new positions.

Block‑copy initialization: expand each pretrained layer into a consecutive block of layers according to a scaling factor s (e.g., 20 → 60 layers gives s =3).

Stability is reinforced by inserting layer‑norm into the query and key paths of linear self‑attention and cross‑attention, which stabilises early training, prevents gradient explosions when new layers are added, and enables rapid adaptation.

3. Memory‑Efficient CAME‑8bit Optimizer

Based on CAME and AdamW‑8bit, CAME‑8bit halves memory usage by factorising second‑moment matrices and applying block‑wise 8‑bit quantisation while keeping 32‑bit precision for critical statistics. For a typical transformer layer, optimizer memory drops from ~57 GB (AdamW) to ~43 GB, enabling billion‑parameter training on a single A100 without sacrificing convergence.

4. Model Depth Pruning

Inspired by Minitron, block‑importance is measured via input‑output similarity across diffusion timesteps on a calibration set of 100 prompts. Head and tail blocks score high, middle blocks lower. Blocks are pruned according to descending importance, then the pruned model is fine‑tuned for 100 steps on one GPU, recovering high‑frequency details. A 1.6 B model pruned to 20 blocks regains quality comparable to the full 4.8 B model.

5. Inference‑Time Scaling

Increasing denoising steps yields diminishing returns; SANA reaches satisfactory visual quality by 20 steps, and more steps do not correct early‑stage errors. Instead, scaling the number of sampled candidates proves more effective. Multiple samples from a small model (1.6 B) are evaluated by a fine‑tuned VLM (NVILA‑2B) in a tournament format, selecting the best matches to the prompt. This “patient teacher” approach boosts accuracy without extra model capacity.

Experiments and Results

1. Experimental Setup

Model architecture: SANA‑4.8 B expands to 60 layers, keeping the same channel dimension (2240) and FFN dimension (5600) as SANA‑1.6 B. Training data and hyper‑parameters follow SANA‑1.6 B.

Training details: Distributed data‑parallel PyTorch on 8 DGX nodes (64 × NVIDIA A100). Two‑stage schedule: pre‑training at learning rate α , then supervised fine‑tuning at reduced β . Global batch size dynamically varies between 1024 and 4096.

Evaluation protocol: Metrics include FID, CLIP score, GenEval, and DPG‑Bench on MJHQ‑30K, with 533 and 1,065 text‑image prompts respectively. GenEval is highlighted for its stronger correlation with text‑image alignment.

2. Main Results

Model growth: Scaling from 1.6 B to 4.8 B improves GenEval by 0.06 (0.66 → 0.72), reduces FID by 0.34 (5.76 → 5.42), and raises DPG score by 0.2 (84.8 → 85.0). Compared with Playground v3 (24 B) and FLUX (12 B), SANA‑4.8 B attains comparable or better quality while using a fraction of the parameters and achieving 5.5× lower latency than FLUX‑dev (23.0 s) and 6.5× higher throughput (≈0.2 samples/s vs 0.04 samples/s on a single A100, FP16).

Model pruning: Pruned SANA‑1.5 models (1.6 B → 4.8 B) outperform from‑scratch training at the same compute budget (GenEval 0.672 vs 0.664). Visual comparison (Fig. 4) shows that aggressive pruning slightly harms fine details, but semantic content remains, and a brief 100‑step fine‑tune restores quality.

Inference scaling: Combining inference scaling with SANA‑4.8 B raises overall accuracy on GenEval by 8 % over single‑sample generation, especially on “color”, “position”, and “object” sub‑tasks. Compared with Playground v3, the scaled SANA‑4.8 B improves accuracy by 4 % while using far fewer FLOPs.

Analysis

Optimizer comparison: CAME‑8bit reduces memory by 25 % (43 GB vs 57 GB) relative to AdamW while matching 32‑bit convergence on SANA‑1.6 B.

Initialization strategies: Partial‑retain initialization yields stable training; cyclic and block‑copy strategies suffer NaN losses, confirming the importance analysis shown in Fig. 5.

Block importance guiding growth: Head and tail blocks contain most information; adding new blocks after these pretrained blocks initially fails to learn, prompting removal of the last two task‑relevant blocks before expansion.

Pruning guided by importance: Middle‑to‑late blocks have low importance scores; pruning them reduces parameters while preserving semantic layout. Fine‑tuning for 100 steps restores high‑frequency details.

Inference scaling law: Accuracy on GenEval continuously improves with more sampled candidates; a 1.6 B model with scaling can surpass a 4.8 B model without scaling. The trade‑off is increased FLOPs for sampling and VLM evaluation.

High‑quality data fine‑tuning: Fine‑tuning on a curated 3 M‑sample subset (out of 50 M pre‑training data) raises the 4.8 B model’s GenEval score by 3 %.

Conclusion

The paper proposes a comprehensive, efficient model‑scaling pipeline that tackles both training and inference compute challenges. By introducing the memory‑efficient CAME‑8bit optimizer, a stable model‑growth strategy, repeated‑sampling inference scaling, and depth‑wise pruning, SANA‑1.5 delivers high‑quality image generation under limited budgets, democratizing large‑scale AI research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIimage generationdiffusionefficient scalinglinear transformerSANA
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.