Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

The DFlash approach replaces speculative decoding’s autoregressive drafter with a block diffusion model and injects target‑model hidden features into every KV‑cache layer, achieving up to 5× speed‑up for Qwen3.5‑27B on single‑GPU and 1.5–1.9× on high‑concurrency workloads while preserving output quality.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Qwen3.5-27B-DFlash Delivers Up to 5× Faster Inference Without Quality Loss

Introduction

Qwen3.5‑27B suffers from low GPU utilization due to the serial nature of autoregressive decoding, where each token is generated one after another. Traditional speculative decoding (e.g., EAGLE‑3) mitigates this by using a small autoregressive drafter, but its speed‑up is limited to 2–3× because the drafter must run a forward pass for every guessed token.

Why DFlash Is Faster

1. Diffusion‑based guessing – DFlash employs a block‑diffusion drafter that generates an entire block of tokens (e.g., 8 or 16) in a single forward pass, allowing a deeper (5‑layer) drafter with higher expressive power and lower latency than EAGLE‑3’s 1‑layer drafter.

2. Target‑model feature injection – Features from the hidden layers of the target model are copied into every KV‑cache layer of the drafter, keeping the speculative information consistent across layers. This contrasts with EAGLE‑3, which only injects target features at the input layer, causing degradation as depth increases.

Popular DFlash Variants

The DFlash family covers many models (see table). The Qwen3.5‑27B‑DFlash variant is the most downloaded on HuggingFace (5.2k+ downloads, 47 likes) and its drafter has only 2 B parameters, making it lightweight.

Benchmark Results

On an NVIDIA B200 GPU, Qwen3.5‑27B with block size 16 and thinking mode enabled achieved the following token‑per‑second (tok/s) throughput:

Math500 (single concurrency): 84 → 397 tok/s (4.7×)

HumanEval (single concurrency): 83 → 427 tok/s (5.2×)

MT‑Bench (single concurrency): 84 → 255 tok/s (3.0×)

Under 32‑way concurrency, speed‑up remains between 1.5× and 1.9×, demonstrating robustness in production‑grade settings.

Acceptance length (the number of draft tokens accepted without re‑generation) also improves, e.g., HumanEval rises from 7.38 to 9.18 tokens per draft.

Installation & Usage

DFlash is integrated with three major inference frameworks:

vLLM (recommended for production)

SGLang

Transformers (quick local testing)

# Install (nightly vLLM)
uv pip install vllm
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
# Launch service with DFlash drafter
vllm serve Qwen/Qwen3.5-27B \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

Similar commands are provided for SGLang and Transformers, including a tip to enable a sliding‑window attention for long contexts.

Technical Details

Training cost : The drafter re‑uses the target model’s embedding layer and LM head (frozen) and only trains the intermediate transformer layers. For the 27 B version, the drafter has 2 B parameters and was trained on ~800 k samples generated from NVIDIA Nemotron and CodeAlpaca.

Single‑step denoising : DFlash performs only one denoising step during inference, unlike conventional diffusion models that require multiple iterations. The presence of target‑model hidden features makes a single step sufficient, keeping the drafter lightweight compared to methods like DiffuSpec.

Reasoning models : When the target model runs in “thinking” mode, DFlash still provides roughly 4.5× acceleration, which is valuable for long‑chain reasoning tasks.

Summary

The core contribution of DFlash is that diffusion models need not compete with autoregressive models on generation quality; they only need to be excellent guessers. By pairing a fast diffusion‑based drafter with a high‑quality autoregressive verifier, DFlash achieves up to 5× speed‑up with zero quality loss.

Outperforms EAGLE‑3 by up to 5× on single‑GPU workloads.

Maintains identical output quality to the original model.

Very lightweight drafter (2 B parameters for the 27 B target).

Supported by vLLM, SGLang, and Transformers.

Broad model coverage (Qwen3, Qwen3.5, Kimi‑K2.5, gpt‑oss, etc.).

Limitations

Requires nightly builds of vLLM/SGLang; stability may vary.

Drafter training code is not yet open‑source (release promised).

Speed‑up diminishes under very high concurrency, a known issue of speculative decoding.

DFlash variants for Qwen3.5‑122B and 397B are still pending.

For inference services based on Qwen3.5‑27B, adding the 2 B DFlash drafter is a near‑free way to multiply throughput.

DFlash architecture diagram
DFlash architecture diagram
diffusion modelSpeculative DecodingvLLMInference AccelerationSGLangQwen3.5DFlash
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.