Artificial Intelligence 11 min read

How dInfer Accelerates Diffusion LLM Inference Over 10× Faster Than Fast‑dLLM

Ant Group's open‑source dInfer framework dramatically speeds up diffusion language model inference—achieving more than a ten‑fold boost over Fast‑dLLM, surpassing autoregressive baselines, and delivering 1011 tokens per second on HumanEval—by tackling computational cost, KV‑cache invalidation, and parallel decoding challenges through modular system‑level innovations.

AntTech

Oct 13, 2025

How dInfer Accelerates Diffusion LLM Inference Over 10× Faster Than Fast‑dLLM

Introducing dInfer: A High‑Performance Diffusion LLM Inference Framework

Ant Group has open‑sourced dInfer, the first industry‑level high‑performance inference framework for diffusion large language models (dLLM). In benchmarks, dInfer runs more than ten times faster than Fast‑dLLM and achieves a record 1011 tokens/second on HumanEval, surpassing highly optimized autoregressive models.

Why Diffusion LLMs Need Faster Inference

Traditional autoregressive (AR) models generate tokens sequentially, limiting parallelism. Diffusion LLMs generate text by iteratively denoising from random noise, offering three theoretical advantages: high parallelism, global context awareness, and structural flexibility. However, they face three core inference challenges:

High computational cost: Multi‑step denoising requires repeated full‑sequence computation.

KV‑cache invalidation: Bidirectional attention changes KV values each iteration, breaking the efficient KV‑cache used by AR models.

Parallel decoding trade‑off: Decoding many tokens simultaneously can cause semantic mismatches, degrading quality.

dInfer’s Modular Architecture

dInfer adopts a plug‑and‑play design with four core modules: Model, KV‑Cache Manager, Iteration Manager, and Decoder, enabling developers to mix and match optimizations like building with LEGO bricks.

Key Optimizations in dInfer

1. Reducing Computation Cost – Vicinity KV‑Cache Refresh dInfer refreshes KV entries only for the current block and its immediate neighbors, based on the principle of semantic locality, avoiding full recomputation while preserving generation quality.

2. System‑Level Optimizations dInfer leverages multi‑card parallelism (tensor + expert parallelism), torch.compile‑based kernel fusion into CUDA graphs, loop unrolling to eliminate GPU idle bubbles, and early stopping after EOS token generation, collectively boosting throughput by over 200%.

3. Parallel Decoding Strategies – Hierarchical Decoding recursively splits the decoding region and decodes central tokens first, reducing interference; Credit Decoding accumulates confidence over iterations, allowing stable tokens with lower instantaneous confidence to be emitted early.

4. Iteration Smoothing dInfer reuses logits from undecoded positions to create weighted embeddings, enriching context and increasing the average number of tokens decoded per iteration by 30‑40%.

Benchmark Results

On a node with eight NVIDIA H800 GPUs, dInfer achieves:

10.7× speedup over Fast‑dLLM (681 vs 63.6 TPS) with comparable model quality.

2.5× faster than the state‑of‑the‑art AR model Qwen2.5‑3B on vLLM (681 vs 277 TPS).

1011 tokens/second on HumanEval, the first open‑source diffusion LLM to surpass AR models in single‑batch latency‑sensitive scenarios.

When combined with Trajectory Distillation, dInfer reaches 847 TPS, more than three times the AR baseline.

Open‑Source Impact

dInfer v0.1, along with its code, technical report, and experimental configs, is fully open‑source, aiming to become a standard platform for dLLM research and an acceleration engine for developers, bridging cutting‑edge AI research with real‑world deployment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM inference optimization AI performance Parallel Decoding Diffusion Language Model dInfer

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.