Artificial Intelligence 14 min read

MIT’s DRiffusion Achieves 1.4–3.7× Faster Diffusion Sampling via Draft‑and‑Refine Parallelism

MIT researchers introduce DRiffusion, a draft‑and‑refine parallel framework that uncovers intrinsic parallelism in diffusion models, delivering 1.4–3.7× speedup on three GPUs while preserving near‑lossless image quality across Stable Diffusion 2.1, SDXL and SD3 evaluated on MS‑COCO.

HyperAI Super Neural

Apr 7, 2026

MIT’s DRiffusion Achieves 1.4–3.7× Faster Diffusion Sampling via Draft‑and‑Refine Parallelism

Overview

Diffusion models require many iterative denoising steps, making sampling slow and inference costly. Existing system‑level tricks (e.g., rectified flows, knowledge distillation) either sacrifice image quality or lack generality, and mathematical solvers often clash with mainstream deep‑learning frameworks.

MIT researchers prove that diffusion processes contain untapped parallelism and introduce the DRiffusion draft‑and‑refine paradigm, which combines system‑level and mathematical insights to accelerate sampling without degrading output quality.

Skip‑Step Transition Operator

The core idea is to treat a skip‑step as an independent local operator. Closed‑form skip‑step formulas are derived for DDPM, DDIM, and ODE‑based solvers, allowing any two diffusion states x_t and x_{t‑k} to be linked directly without a globally scheduled timestep sequence.

For DDPM, a closed‑form expression maps state x_t to x_{t‑k}. DDIM extends this via marginal consistency, and ODE modeling interprets larger integration steps as skip‑steps. This operator enables parallel generation of multiple future states from a given anchor timestep.

Draft‑and‑Refine Workflow

Given an anchor state x_t, the draft phase uses the skip‑step operator to generate estimates for the next k timesteps in parallel. Drafts are less precise because of larger step sizes but remain on the same denoising trajectory.

In the refine phase, each draft is fed into the noise predictor, producing noise estimates that are then used in the standard denoising update. The refined states become anchors for the next iteration, preserving quality while exploiting parallel computation.

Aggressive vs. Conservative Modes

The aggressive version fully parallelizes all noise predictions within one iteration, achieving an ideal k ‑fold speedup (runtime reduced to 1/k of the original).

The conservative version first computes a high‑precision current noise, then reproduces the aggressive workflow while advancing one additional timestep, yielding an ideal speedup of 2k+1. Both modes share the principle: drafts provide parallelism, refinement safeguards quality.

Experimental Setup

Benchmarks use the MS‑COCO 2017 validation set (5,000 images, 5 captions each, first caption only). Quality metrics include FID, CLIP score, PickScore, and Human Preference Score v2.1 (HPSv2.1). Efficiency is measured on up to four NVIDIA V100 GPUs, reporting relative speedup and extra memory overhead.

Baselines: (1) direct skip‑step (reducing sampling steps) and (2) AsyncDiff (asynchronous sub‑network sampling). AsyncDiff’s official implementation was reproduced under identical settings.

Results

Qualitative inspection shows DRiffusion retains semantic consistency and fine‑grained details (e.g., wood texture, cat‑eye highlights) even at high acceleration. Occasionally, larger step sizes improve contrast and sharpness; aggressive mode may introduce slight color oversaturation or minor artifacts.

Quantitatively, across all configurations DRiffusion’s FID remains close to baseline, with CLIP score drops ≤ 0.16. PickScore and HPSv2.1 average degradations are 0.17 and 0.43 respectively; the only outlier is SD3 in 4‑GPU aggressive mode where HPSv2.1 drops 1.50 due to its native 28‑step schedule.

Speedup ranges from 1.4× to 3.7× per sample, matching theoretical O(1/N) for aggressive and O(2/(N+1)) for conservative scaling. Memory overhead stays modest (186–226 MB) compared to AsyncDiff’s up to 574 MB and the baseline’s ~13 GB.

In all acceleration groups DRiffusion outperforms AsyncDiff and simple skip‑step baselines, reducing performance gaps by an average of 48.6 % (up to 58.5 % on four devices) when evaluated with PickScore.

Broader Context

Parallelization of diffusion models is an active research area. Related works include Fast‑dLLM (27.6× end‑to‑end speedup for large‑language diffusion models) and StreamDiffusionV2 (video generation at 58 FPS). DRiffusion demonstrates that exploiting intrinsic parallelism can achieve substantial acceleration without retraining models.

Paper: https://arxiv.org/abs/2603.25872