How dInfer Accelerates Diffusion LLM Inference Over 10× Faster Than Fast‑dLLM
Ant Group's open‑source dInfer framework dramatically speeds up diffusion language model inference—achieving more than a ten‑fold boost over Fast‑dLLM, surpassing autoregressive baselines, and delivering 1011 tokens per second on HumanEval—by tackling computational cost, KV‑cache invalidation, and parallel decoding challenges through modular system‑level innovations.
Introducing dInfer: A High‑Performance Diffusion LLM Inference Framework
Ant Group has open‑sourced dInfer, the first industry‑level high‑performance inference framework for diffusion large language models (dLLM). In benchmarks, dInfer runs more than ten times faster than Fast‑dLLM and achieves a record 1011 tokens/second on HumanEval, surpassing highly optimized autoregressive models.
Why Diffusion LLMs Need Faster Inference
Traditional autoregressive (AR) models generate tokens sequentially, limiting parallelism. Diffusion LLMs generate text by iteratively denoising from random noise, offering three theoretical advantages: high parallelism, global context awareness, and structural flexibility. However, they face three core inference challenges:
High computational cost: Multi‑step denoising requires repeated full‑sequence computation.
KV‑cache invalidation: Bidirectional attention changes KV values each iteration, breaking the efficient KV‑cache used by AR models.
Parallel decoding trade‑off: Decoding many tokens simultaneously can cause semantic mismatches, degrading quality.
dInfer’s Modular Architecture
dInfer adopts a plug‑and‑play design with four core modules: Model, KV‑Cache Manager, Iteration Manager, and Decoder, enabling developers to mix and match optimizations like building with LEGO bricks.
Key Optimizations in dInfer
1. Reducing Computation Cost – Vicinity KV‑Cache Refresh dInfer refreshes KV entries only for the current block and its immediate neighbors, based on the principle of semantic locality, avoiding full recomputation while preserving generation quality.
2. System‑Level Optimizations dInfer leverages multi‑card parallelism (tensor + expert parallelism), torch.compile‑based kernel fusion into CUDA graphs, loop unrolling to eliminate GPU idle bubbles, and early stopping after EOS token generation, collectively boosting throughput by over 200%.
3. Parallel Decoding Strategies – Hierarchical Decoding recursively splits the decoding region and decodes central tokens first, reducing interference; Credit Decoding accumulates confidence over iterations, allowing stable tokens with lower instantaneous confidence to be emitted early.
4. Iteration Smoothing dInfer reuses logits from undecoded positions to create weighted embeddings, enriching context and increasing the average number of tokens decoded per iteration by 30‑40%.
Benchmark Results
On a node with eight NVIDIA H800 GPUs, dInfer achieves:
10.7× speedup over Fast‑dLLM (681 vs 63.6 TPS) with comparable model quality.
2.5× faster than the state‑of‑the‑art AR model Qwen2.5‑3B on vLLM (681 vs 277 TPS).
1011 tokens/second on HumanEval, the first open‑source diffusion LLM to surpass AR models in single‑batch latency‑sensitive scenarios.
When combined with Trajectory Distillation, dInfer reaches 847 TPS, more than three times the AR baseline.
Open‑Source Impact
dInfer v0.1, along with its code, technical report, and experimental configs, is fully open‑source, aiming to become a standard platform for dLLM research and an acceleration engine for developers, bridging cutting‑edge AI research with real‑world deployment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
