How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)
The article explains how DFlash’s block‑diffusion draft model and KV Injection boost speculative decoding speed by 5‑8× without sacrificing output quality, and how DDTree further raises the gain to over 8×, backed by benchmark results and integration guides for major inference frameworks.
Background: Speculative Decoding
Large language models generate text token by token, which becomes the primary bottleneck regardless of GPU power. Speculative decoding mitigates this by letting a smaller draft model quickly guess a sequence of tokens, which the large model then verifies in a single forward pass; correct guesses speed up inference, while incorrect ones are simply corrected.
DFlash – Replacing Autoregressive Drafts with Block Diffusion
DFlash (Block Diffusion for Flash Speculative Decoding) from Z Lab introduces a lightweight block diffusion model that generates an entire token block (block size = 16) in one forward pass, eliminating the “slow guessing” problem of traditional draft models.
The key technique is KV Injection : hidden features from multiple layers of the target model are fused into the draft model’s KV cache, enabling high‑quality predictions from the draft.
Benchmark results (T = 0.0) show speedups of:
HumanEval: 6.09× (Qwen3‑30B‑MoE)
MATH‑500: 6.17× (Qwen3‑8B)
GSM8K: 5.20× (Qwen3‑8B)
AIME24: 5.91× (Qwen3‑8B)
MBPP: 4.75× (Qwen3‑8B)
Compared with the popular EAGLE‑3 approach (≈2‑3×), DFlash is about 2.5× faster, reaching 5‑6× acceleration even in sampling mode (Temperature = 1) where many methods degrade.
DDTree – Extending DFlash with a Draft Tree
DDTree (Diffusion Draft Tree), built on DFlash by Liran Ringel, constructs a probability‑tree of multiple promising draft paths using a best‑first heap algorithm, then validates the entire tree in a single forward pass of the target model.
Four‑step DDTree workflow:
Block diffusion generates probability distributions for L positions.
Best‑first heap builds an optimal draft tree under a node budget B.
Tree attention compiles the tree into the target model’s input.
Verification traverses the tree: matching nodes continue, mismatches trigger a bonus token for the next round.
The method has a mathematical guarantee that the constructed tree maximizes the expected accepted length under the draft model’s distribution.
On HumanEval (T = 0.0), DDTree lifts DFlash’s 6.09× speedup to 8.22×, an additional 2.13× gain, while remaining completely lossless—the output distribution matches that of unaccelerated decoding.
Supported Models and Integration
DFlash draft models are available for several mainstream LLMs, including Kimi‑K2.5, Qwen3.5‑4B/9B/27B, Qwen3.5‑35B‑A3B, Qwen3‑Coder‑30B‑A3B, and LLaMA‑3.1‑8B‑Instruct. Drafts for larger models such as Qwen3.5‑122B, 397B, and GLM‑5.1 are in progress.
Integration commands:
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-35B-A3B \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
--tp-size 1 --attention-backend trtllm_mha vllm serve Qwen/Qwen3.5-27B \
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' pip install -e ".[mlx]"DDTree benchmark can be run with:
git clone https://github.com/liranringel/ddtree
cd ddtree
pip install -r requirements.txt
bash run_benchmark.sh
python3 plot_results.pyConclusion
The DFlash + DDTree combination represents the next stage of speculative decoding, delivering over 8× lossless acceleration for large‑model inference and already being usable in SGLang, vLLM, and Apple Silicon (MLX) frameworks, effectively offering a “free lunch” for deployment teams.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
