DFlash Boosts Large Model Inference Up to 6× – Now Supporting DeepSeek-V4

DFlash replaces the speculative draft model with a block‑diffusion drafter, generating 16 tokens per forward pass and achieving up to 6× speedup over baseline (2.5× over EAGLE‑3) without quality loss, while supporting a wide range of open‑source LLMs and multiple back‑ends.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
DFlash Boosts Large Model Inference Up to 6× – Now Supporting DeepSeek-V4

Introduction

DFlash is a speculative decoding system that replaces the traditional autoregressive draft model with a block‑diffusion drafter, which emits a whole block of 16 tokens in a single forward pass. This yields up to 6× acceleration with no loss of output quality and works as a drop‑in replacement for existing inference servers.

Core Insight

The hidden states of a large target LLM already contain information about future tokens (N+1, N+2, …). DFlash feeds these hidden features into the draft model, letting the draft "stand on the shoulders of the giant" instead of guessing from scratch.

Why Diffusion Works Best

Autoregressive draft models incur a cost that grows linearly with the number of tokens, forcing prior works like EAGLE‑3 to shrink the network to a single transformer layer, limiting quality. In contrast, a diffusion drafter’s cost is almost independent of token count because it generates an entire block in one forward pass.

A multi‑layer DFlash generating 16 tokens is faster than a 1‑layer EAGLE‑3 generating 8 tokens.

Implementation Details

Feature Fusion : Uniformly sample hidden features from multiple layers of the target model and apply a lightweight projection to fuse them.

KV Injection : Inject the fused features into the K/V cache of every layer of the draft model. This differs from EAGLE‑3, which only injects at the first layer.

Parallel Drafting : Using the enriched context, predict the next block of tokens in one step.

Benchmark Results (Qwen3‑8B)

Task          Baseline   EAGLE‑3   DFlash
------------------------------------------
GSM8K          1×        2.13×    5.20×
MATH‑500       1×        2.18×    6.17×
AIME24         1×        2.25×    5.91×
AIME25         1×        2.18×    5.85×
HumanEval      1×        2.48×    5.20×
MBPP           1×        2.27×    4.75×
LiveCodeBench  1×        2.24×    5.43×
SWE‑Bench      1×        1.90×    2.92×
MT‑Bench       1×        1.94×    2.79×
Alpaca         1×        1.88×    2.27×

On mathematical and code benchmarks, DFlash is more than twice as fast as EAGLE‑3. Under temperature sampling (temp=1) and the "thinking" mode, DFlash maintains a stable ~4.5× speedup.

Supported Models

Gemma series: gemma‑4‑26B‑A4B‑it, gemma‑4‑31B‑it

Qwen series: Qwen3.6‑27B, Qwen3.6‑35B‑A3B, Qwen3.5‑4B/9B/27B/35B‑A3B/122B‑A10B

Coder series: Qwen3‑Coder‑Next, Qwen3‑Coder‑30B‑A3B

Major vendor models: MiniMax‑M2.5 (preview), Kimi‑K2.5

OSS: gpt‑oss‑20b, gpt‑oss‑120b

Coming soon: DeepSeek‑V4‑Flash, V4‑Pro, MiniMax‑M2.7, GLM‑5.1

Installation

DFlash can be used with four back‑ends. Install the desired extra with pip:

# Transformers
uv pip install -e "[transformers]"
# SGLang
uv pip install -e "[sglang]"
# vLLM (v0.20.1+ supports the kernel)
uv pip install -e "[vllm]"
# MLX (Apple Silicon)
pip install -e "[mlx]"

For Gemma 4 on vLLM, a pre‑built Docker image is provided:

docker pull ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130

Usage Examples

vLLM (Qwen3.5‑27B)

vllm serve Qwen/Qwen3.5-27B \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

Gemma 4 via Docker

docker run --rm -it \
  --gpus all --ipc=host --shm-size=16g \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130 \
  google/gemma-4-26B-A4B-it \
  --host 0.0.0.0 --port 8000 \
  --speculative-config '{"method": "dflash", "model": "z-lab/gemma-4-26B-A4B-it-DFlash", "num_speculative_tokens": 15, "attention_backend": "flash_attn"}' \
  --attention-backend triton_attn \
  --max-num-batched-tokens 32768 \
  --trust-remote-code

SGLang

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-35B-A3B \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path z-lab/Qwen3.5-35B-A3B-DFlash \
  --speculative-num-draft-tokens 16 \
  --tp-size 1 \
  --attention-backend trtllm_mha \
  --speculative-draft-attention-backend fa4 \
  --mem-fraction-static 0.75 \
  --trust-remote-code

Transformers (Python)

from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

draft = AutoModel.from_pretrained(
    "z-lab/Qwen3-8B-DFlash-b16",
    trust_remote_code=True, dtype="auto", device_map="cuda:0").eval()

target = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B", dtype="auto", device_map="cuda:0").eval()

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

messages = [{"role": "user", "content": "How many positive whole-number divisors does 196 have?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).to(draft.device)

output = draft.spec_generate(
    input_ids=input_ids, max_new_tokens=2048, temperature=0.0,
    target=target, stop_token_ids=[tokenizer.eos_token_id])
print(tokenizer.decode(output[0], skip_special_tokens=False))

MLX (Apple Silicon)

from dflash.model_mlx import load, load_draft, stream_generate

model, tokenizer = load("Qwen/Qwen3.5-4B")
draft = load_draft("z-lab/Qwen3.5-4B-DFlash")

messages = [{"role": "user", "content": "How many positive whole-number divisors does 196 have?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)

for r in stream_generate(model, draft, tokenizer, prompt, block_size=16, max_tokens=2048, temperature=0.6):
    print(r.text, end="", flush=True)
    tps = r.generation_tps
print(f"
Throughput: {tps:.2f} tok/s")

Conclusion

DFlash redefines the role of diffusion models – they only need to be an ultra‑fast, ultra‑accurate drafter, while the target model guarantees final quality.
draft model directly reuses target embedding and LM head, only a few middle layers are newly trained, keeping parameter count low
draft model directly reuses target embedding and LM head, only a few middle layers are newly trained, keeping parameter count low
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performancespeculative decodingvLLMLLM inferenceDFlashBlock Diffusion
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.