DFlash Boosts Large Model Inference Up to 6× – Now Supporting DeepSeek-V4
DFlash replaces the speculative draft model with a block‑diffusion drafter, generating 16 tokens per forward pass and achieving up to 6× speedup over baseline (2.5× over EAGLE‑3) without quality loss, while supporting a wide range of open‑source LLMs and multiple back‑ends.
Introduction
DFlash is a speculative decoding system that replaces the traditional autoregressive draft model with a block‑diffusion drafter, which emits a whole block of 16 tokens in a single forward pass. This yields up to 6× acceleration with no loss of output quality and works as a drop‑in replacement for existing inference servers.
Core Insight
The hidden states of a large target LLM already contain information about future tokens (N+1, N+2, …). DFlash feeds these hidden features into the draft model, letting the draft "stand on the shoulders of the giant" instead of guessing from scratch.
Why Diffusion Works Best
Autoregressive draft models incur a cost that grows linearly with the number of tokens, forcing prior works like EAGLE‑3 to shrink the network to a single transformer layer, limiting quality. In contrast, a diffusion drafter’s cost is almost independent of token count because it generates an entire block in one forward pass.
A multi‑layer DFlash generating 16 tokens is faster than a 1‑layer EAGLE‑3 generating 8 tokens.
Implementation Details
Feature Fusion : Uniformly sample hidden features from multiple layers of the target model and apply a lightweight projection to fuse them.
KV Injection : Inject the fused features into the K/V cache of every layer of the draft model. This differs from EAGLE‑3, which only injects at the first layer.
Parallel Drafting : Using the enriched context, predict the next block of tokens in one step.
Benchmark Results (Qwen3‑8B)
Task Baseline EAGLE‑3 DFlash
------------------------------------------
GSM8K 1× 2.13× 5.20×
MATH‑500 1× 2.18× 6.17×
AIME24 1× 2.25× 5.91×
AIME25 1× 2.18× 5.85×
HumanEval 1× 2.48× 5.20×
MBPP 1× 2.27× 4.75×
LiveCodeBench 1× 2.24× 5.43×
SWE‑Bench 1× 1.90× 2.92×
MT‑Bench 1× 1.94× 2.79×
Alpaca 1× 1.88× 2.27×On mathematical and code benchmarks, DFlash is more than twice as fast as EAGLE‑3. Under temperature sampling (temp=1) and the "thinking" mode, DFlash maintains a stable ~4.5× speedup.
Supported Models
Gemma series: gemma‑4‑26B‑A4B‑it, gemma‑4‑31B‑it
Qwen series: Qwen3.6‑27B, Qwen3.6‑35B‑A3B, Qwen3.5‑4B/9B/27B/35B‑A3B/122B‑A10B
Coder series: Qwen3‑Coder‑Next, Qwen3‑Coder‑30B‑A3B
Major vendor models: MiniMax‑M2.5 (preview), Kimi‑K2.5
OSS: gpt‑oss‑20b, gpt‑oss‑120b
Coming soon: DeepSeek‑V4‑Flash, V4‑Pro, MiniMax‑M2.7, GLM‑5.1
Installation
DFlash can be used with four back‑ends. Install the desired extra with pip:
# Transformers
uv pip install -e "[transformers]"
# SGLang
uv pip install -e "[sglang]"
# vLLM (v0.20.1+ supports the kernel)
uv pip install -e "[vllm]"
# MLX (Apple Silicon)
pip install -e "[mlx]"For Gemma 4 on vLLM, a pre‑built Docker image is provided:
docker pull ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130Usage Examples
vLLM (Qwen3.5‑27B)
vllm serve Qwen/Qwen3.5-27B \
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
--attention-backend flash_attn \
--max-num-batched-tokens 32768Gemma 4 via Docker
docker run --rm -it \
--gpus all --ipc=host --shm-size=16g \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130 \
google/gemma-4-26B-A4B-it \
--host 0.0.0.0 --port 8000 \
--speculative-config '{"method": "dflash", "model": "z-lab/gemma-4-26B-A4B-it-DFlash", "num_speculative_tokens": 15, "attention_backend": "flash_attn"}' \
--attention-backend triton_attn \
--max-num-batched-tokens 32768 \
--trust-remote-codeSGLang
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-35B-A3B \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
--speculative-num-draft-tokens 16 \
--tp-size 1 \
--attention-backend trtllm_mha \
--speculative-draft-attention-backend fa4 \
--mem-fraction-static 0.75 \
--trust-remote-codeTransformers (Python)
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
draft = AutoModel.from_pretrained(
"z-lab/Qwen3-8B-DFlash-b16",
trust_remote_code=True, dtype="auto", device_map="cuda:0").eval()
target = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B", dtype="auto", device_map="cuda:0").eval()
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
messages = [{"role": "user", "content": "How many positive whole-number divisors does 196 have?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).to(draft.device)
output = draft.spec_generate(
input_ids=input_ids, max_new_tokens=2048, temperature=0.0,
target=target, stop_token_ids=[tokenizer.eos_token_id])
print(tokenizer.decode(output[0], skip_special_tokens=False))MLX (Apple Silicon)
from dflash.model_mlx import load, load_draft, stream_generate
model, tokenizer = load("Qwen/Qwen3.5-4B")
draft = load_draft("z-lab/Qwen3.5-4B-DFlash")
messages = [{"role": "user", "content": "How many positive whole-number divisors does 196 have?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
for r in stream_generate(model, draft, tokenizer, prompt, block_size=16, max_tokens=2048, temperature=0.6):
print(r.text, end="", flush=True)
tps = r.generation_tps
print(f"
Throughput: {tps:.2f} tok/s")Conclusion
DFlash redefines the role of diffusion models – they only need to be an ultra‑fast, ultra‑accurate drafter, while the target model guarantees final quality.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
