7 min read

How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)

The article explains how DFlash’s block‑diffusion draft model and KV Injection boost speculative decoding speed by 5‑8× without sacrificing output quality, and how DDTree further raises the gain to over 8×, backed by benchmark results and integration guides for major inference frameworks.

Old Zhang's AI Learning

Apr 17, 2026

How DFlash Achieves 8× Lossless Acceleration for Large‑Model Inference (Qwen3.5‑27B Example)

Background: Speculative Decoding

Large language models generate text token by token, which becomes the primary bottleneck regardless of GPU power. Speculative decoding mitigates this by letting a smaller draft model quickly guess a sequence of tokens, which the large model then verifies in a single forward pass; correct guesses speed up inference, while incorrect ones are simply corrected.

DFlash – Replacing Autoregressive Drafts with Block Diffusion

DFlash (Block Diffusion for Flash Speculative Decoding) from Z Lab introduces a lightweight block diffusion model that generates an entire token block (block size = 16) in one forward pass, eliminating the “slow guessing” problem of traditional draft models.

The key technique is KV Injection : hidden features from multiple layers of the target model are fused into the draft model’s KV cache, enabling high‑quality predictions from the draft.

Benchmark results (T = 0.0) show speedups of:

HumanEval: 6.09× (Qwen3‑30B‑MoE)

MATH‑500: 6.17× (Qwen3‑8B)

GSM8K: 5.20× (Qwen3‑8B)

AIME24: 5.91× (Qwen3‑8B)

MBPP: 4.75× (Qwen3‑8B)

Compared with the popular EAGLE‑3 approach (≈2‑3×), DFlash is about 2.5× faster, reaching 5‑6× acceleration even in sampling mode (Temperature = 1) where many methods degrade.

DDTree – Extending DFlash with a Draft Tree

DDTree (Diffusion Draft Tree), built on DFlash by Liran Ringel, constructs a probability‑tree of multiple promising draft paths using a best‑first heap algorithm, then validates the entire tree in a single forward pass of the target model.

Four‑step DDTree workflow:

Block diffusion generates probability distributions for L positions.

Best‑first heap builds an optimal draft tree under a node budget B.

Tree attention compiles the tree into the target model’s input.

Verification traverses the tree: matching nodes continue, mismatches trigger a bonus token for the next round.

The method has a mathematical guarantee that the constructed tree maximizes the expected accepted length under the draft model’s distribution.

On HumanEval (T = 0.0), DDTree lifts DFlash’s 6.09× speedup to 8.22×, an additional 2.13× gain, while remaining completely lossless—the output distribution matches that of unaccelerated decoding.

Supported Models and Integration

DFlash draft models are available for several mainstream LLMs, including Kimi‑K2.5, Qwen3.5‑4B/9B/27B, Qwen3.5‑35B‑A3B, Qwen3‑Coder‑30B‑A3B, and LLaMA‑3.1‑8B‑Instruct. Drafts for larger models such as Qwen3.5‑122B, 397B, and GLM‑5.1 are in progress.

Integration commands:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-35B-A3B \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
    --tp-size 1 --attention-backend trtllm_mha

vllm serve Qwen/Qwen3.5-27B \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}'

pip install -e ".[mlx]"

DDTree benchmark can be run with:

git clone https://github.com/liranringel/ddtree
cd ddtree
pip install -r requirements.txt
bash run_benchmark.sh
python3 plot_results.py

Conclusion

The DFlash + DDTree combination represents the next stage of speculative decoding, delivering over 8× lossless acceleration for large‑model inference and already being usable in SGLang, vLLM, and Apple Silicon (MLX) frameworks, effectively offering a “free lunch” for deployment teams.

Speculative Decoding acceleration Large Language Model Inference DDTree DFlash

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.