How DiSA Accelerates Autoregressive Image Generation with Diffusion Step Annealing
The article introduces DiSA, a training‑free diffusion step annealing technique that dramatically speeds up autoregressive image generation by reducing diffusion steps in later generation phases while preserving high visual quality, and validates the method across several state‑of‑the‑art AR‑Diffusion models.
Overview
This article reviews the DiSA (Diffusion Step Annealing) paradigm, which integrates the gradual annealing process of diffusion models into autoregressive (AR) image generation. DiSA improves sampling efficiency without sacrificing image quality by using many diffusion steps early on and far fewer steps later, addressing the instability of AR models on complex data.
Research Background
Recent AR models such as MAR, FlowAR, xAR, and Harmon incorporate diffusion sampling to boost image quality, but the diffusion process requires 50–100 denoising steps per token, causing substantial inference latency. For example, diffusion steps account for roughly 50% of MAR’s latency and up to 90% for xAR. Reducing diffusion steps directly speeds up inference but severely degrades quality.
Re‑thinking AR + Diffusion Models
Images are tokenized (e.g., via a VAE) into a sequence of discrete tokens. Autoregressive generation predicts the next token conditioned on previously generated tokens. Existing AR‑Diffusion models generate a batch of tokens per AR step and then apply a diffusion head to sample the next token, which incurs many diffusion steps.
Key Findings
1) Later AR steps produce tokens with lower variance, making them easier to predict.
Experiments on MAR show that as more tokens are generated, the distribution of the next token becomes increasingly constrained, leading to lower variance and more accurate predictions.
2) The variance of generated tokens decreases as generation progresses.
Sampling 10K images from MAR and measuring variance across 100 possible next tokens per step reveals a clear downward trend in variance during later steps.
3) Diffusion trajectories become more linear in later stages.
Using the Straightness metric (cosine similarity between the diffusion score function and the straight line from noise to clean token), the authors observe that later diffusion paths align more closely with a straight line, suggesting that larger step sizes are feasible.
Diffusion Step Annealing (DiSA)
Based on the observations, DiSA adopts a training‑free schedule: early generation phases use many diffusion steps (e.g., 50), while later phases use far fewer steps (e.g., 5). The method evaluates three schedulers—two‑stage, linear, and cosine—and finds the linear scheduler offers the best trade‑off.
Experimental Setup
Four pretrained models (MAR, FlowAR, xAR, Harmon) are evaluated on ImageNet 256×256 generation and on the GenEval T2I benchmark. Inference time is measured for a batch of 256 images on four NVIDIA A100 PCIe GPUs.
Results
Applying DiSA to MAR, FlowAR, and xAR yields 4–11× speed‑ups with negligible quality loss. For MAR‑B, DiSA achieves 5.7× acceleration with unchanged FID; for MAR‑L, 5.1× acceleration with a 0.02 FID increase. Reducing autoregressive steps further (e.g., 32 steps) combined with DiSA can reach up to 11.3× speed‑up.
DiSA also accelerates the Harmon T2I model by 5× (8 s per image) while maintaining comparable performance.
Comparison with Other Acceleration Methods
DiSA outperforms CSpD and FAR and is competitive with LazyMAR, which is orthogonal and can be combined with DiSA. DiSA also complements existing diffusion‑specific accelerators such as DDIM, DPM‑Solver, and DPM‑Solver++.
Speed‑Quality Trade‑off
Across various AR steps and diffusion step configurations, DiSA consistently improves inference speed while preserving generation quality, as illustrated in the speed‑quality curves.
Sample images generated with DiSA demonstrate comparable visual fidelity to baseline methods.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
