Artificial Intelligence 16 min read

How DiSA Accelerates Autoregressive Image Generation with Diffusion Step Annealing

The article introduces DiSA, a training‑free diffusion step annealing technique that dramatically speeds up autoregressive image generation by reducing diffusion steps in later generation phases while preserving high visual quality, and validates the method across several state‑of‑the‑art AR‑Diffusion models.

AI Frontier Lectures

Jun 9, 2025

How DiSA Accelerates Autoregressive Image Generation with Diffusion Step Annealing

Overview

This article reviews the DiSA (Diffusion Step Annealing) paradigm, which integrates the gradual annealing process of diffusion models into autoregressive (AR) image generation. DiSA improves sampling efficiency without sacrificing image quality by using many diffusion steps early on and far fewer steps later, addressing the instability of AR models on complex data.

Research Background

Recent AR models such as MAR, FlowAR, xAR, and Harmon incorporate diffusion sampling to boost image quality, but the diffusion process requires 50–100 denoising steps per token, causing substantial inference latency. For example, diffusion steps account for roughly 50% of MAR’s latency and up to 90% for xAR. Reducing diffusion steps directly speeds up inference but severely degrades quality.

Figure 1: Structures of four AR+Diffusion models (a) MAR, (b) FlowAR, (c) xAR, (d) Harmon

Re‑thinking AR + Diffusion Models

Images are tokenized (e.g., via a VAE) into a sequence of discrete tokens. Autoregressive generation predicts the next token conditioned on previously generated tokens. Existing AR‑Diffusion models generate a batch of tokens per AR step and then apply a diffusion head to sample the next token, which incurs many diffusion steps.

Diagram of image tokenization and AR generation

Key Findings

1) Later AR steps produce tokens with lower variance, making them easier to predict.

Experiments on MAR show that as more tokens are generated, the distribution of the next token becomes increasingly constrained, leading to lower variance and more accurate predictions.

2) The variance of generated tokens decreases as generation progresses.

Sampling 10K images from MAR and measuring variance across 100 possible next tokens per step reveals a clear downward trend in variance during later steps.

3) Diffusion trajectories become more linear in later stages.

Using the Straightness metric (cosine similarity between the diffusion score function and the straight line from noise to clean token), the authors observe that later diffusion paths align more closely with a straight line, suggesting that larger step sizes are feasible.

Diffusion Step Annealing (DiSA)

Based on the observations, DiSA adopts a training‑free schedule: early generation phases use many diffusion steps (e.g., 50), while later phases use far fewer steps (e.g., 5). The method evaluates three schedulers—two‑stage, linear, and cosine—and finds the linear scheduler offers the best trade‑off.

Experimental Setup

Four pretrained models (MAR, FlowAR, xAR, Harmon) are evaluated on ImageNet 256×256 generation and on the GenEval T2I benchmark. Inference time is measured for a batch of 256 images on four NVIDIA A100 PCIe GPUs.

Results

Applying DiSA to MAR, FlowAR, and xAR yields 4–11× speed‑ups with negligible quality loss. For MAR‑B, DiSA achieves 5.7× acceleration with unchanged FID; for MAR‑L, 5.1× acceleration with a 0.02 FID increase. Reducing autoregressive steps further (e.g., 32 steps) combined with DiSA can reach up to 11.3× speed‑up.

DiSA also accelerates the Harmon T2I model by 5× (8 s per image) while maintaining comparable performance.

Figure 6: Harmon GenEval results with DiSA

Comparison with Other Acceleration Methods

DiSA outperforms CSpD and FAR and is competitive with LazyMAR, which is orthogonal and can be combined with DiSA. DiSA also complements existing diffusion‑specific accelerators such as DDIM, DPM‑Solver, and DPM‑Solver++.

Figure 7: Combining DiSA with other diffusion accelerators

Speed‑Quality Trade‑off

Across various AR steps and diffusion step configurations, DiSA consistently improves inference speed while preserving generation quality, as illustrated in the speed‑quality curves.