Google’s 26B DiffusionGemma Model Delivers 1000+ Tokens/s – Runs on a 4090
DiffusionGemma, Google DeepMind’s 26B MoE model that generates 256‑token blocks via diffusion, achieves over 1000 tokens per second on H100/H200 GPUs, offers FP8 and NVFP4 quantized versions with near‑lossless accuracy, and can be deployed locally with vLLM Docker images, though it incurs higher first‑token latency and limited concurrency.
Overview
DiffusionGemma is a 26‑billion‑parameter model built by Google DeepMind on the Gemma‑4 architecture. It differs from traditional autoregressive (AR) models by generating whole 256‑token blocks at once through a diffusion process, which dramatically reduces KV‑cache reads and better utilizes idle compute.
Total parameters: 25.2 B
Active (MoE) parameters: 3.8 B (128 experts, top‑8 routing)
Context window: 256 K tokens
Canvas length: 256 tokens
Vocabulary size: 262 K
Supported modalities: text, image, video
License: Apache 2.0 (open source)
Why It Is Fast
Traditional AR models emit one token per step, requiring a full KV‑cache read for each token, which leaves much compute idle in low‑batch scenarios. DiffusionGemma instead generates 256 tokens per diffusion step, iterating 10‑20 denoising steps until the canvas stabilises, thereby converting idle compute into lower latency for subsequent tokens. The trade‑off is a higher first‑token latency because the model must finish one full denoising cycle before emitting any output.
Performance Measurements
vLLM’s official batch‑size‑1 tests (FP8‑quantized) show:
H200 GPU: 1,288 tok/s (≈6× faster than AR baseline, 3× faster than MTP baseline)
H100 GPU: 1,008 tok/s (≈5× faster than AR baseline, 2.6× faster than MTP baseline)
More detailed single‑request benchmarks:
Output throughput: 199 tok/s (Gemma‑4 26B AR) vs 375 tok/s (DiffusionGemma) → 1.9× increase
Single‑request generation speed: 205 tok/s vs 1,282 tok/s → 6.2× increase
End‑to‑end latency: 2.87 s vs 0.88 s → 3.3× faster
First‑token latency: 53 ms vs 489 ms (higher for DiffusionGemma due to full‑canvas denoising)
Quantized Versions
RedHat AI released two quantized builds with negligible accuracy loss.
FP8 dynamic quantization – recovery rates: AIME 2025 96.8 %, GPQA Diamond 102.5 %, GSM8K 99.9 %, AIME 2025 Thinking 101.5 %.
NVFP4 quantization – recovery rates: AIME 2025 97.7 %, GPQA Diamond 100.5 %, GSM8K 100 %, AIME 2025 Thinking 98 %.
Both versions retain multimodal support and the “Thinking” mode for stronger reasoning.
Local Deployment
DiffusionGemma runs via the vLLM Gemma Docker image.
# Pull the image
docker pull vllm/vllm-openai:gemma
# Single‑GPU deployment (H100/H200)
docker run -itd --name diffusiongemma \
--ipc=host --network host --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:gemma \
--model google/diffusiongemma-26B-A4B-it \
--max-model-len 262144 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.85 \
--host 0.0.0.0 --port 8000For the FP8 quantized model:
VLLM_USE_V2_MODEL_RUNNER=1
vllm serve RedHatAI/diffusiongemma-26B-A4B-it-FP8-dynamic \
--trust-remote-code \
--attention-backend TRITON_ATTN \
--max-num-seqs 4 \
--hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' \
--default-chat-template-kwargs '{"enable_thinking": true}'Key runtime flags: --max-num-seqs 4: limits concurrency because the diffusion canvas state is large. --gpu-memory-utilization 0.85: reserves memory for the denoising step. --attention-backend TRITON_ATTN: required for the quantized builds.
Usage
After deployment, the API matches the OpenAI format.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="google/diffusiongemma-26B-A4B-it",
messages=[{"role": "user", "content": "用Python写一个快速排序"}],
max_tokens=512,
temperature=0.7,
)
print(response.choices[0].message.content)Enabling the Thinking mode for stronger reasoning:
response = client.chat.completions.create(
model="google/diffusiongemma-26B-A4B-it",
messages=[{"role": "user", "content": "求 x^3 * ln(x) 的导数"}],
max_tokens=32768,
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)Pros and Cons
Pros:
Inference speed >1000 tok/s, 5‑6× faster than AR models.
MoE architecture keeps active parameters low (3.8 B), enabling single‑GPU runs.
FP8 and NVFP4 quantizations retain near‑lossless accuracy.
Supports multimodal inputs (text, image, video).
Thinking mode improves complex reasoning.
Apache 2.0 license allows unrestricted commercial use.
OpenAI‑compatible API simplifies integration.
Cons:
High first‑token latency (~489 ms) makes it unsuitable for latency‑sensitive streaming chats.
Concurrency limited to ≤4 sequences per GPU.
Currently only supported by vLLM; ecosystem still maturing.
Benchmark scores slightly lower than same‑size Gemma‑4 (e.g., AIME 69.1 % vs 88.3 %).
Conclusion
DiffusionGemma demonstrates that diffusion‑based generation can push 26 B‑scale open‑source models to >1000 tokens/s on a single H100/H200, offering a compelling speed advantage for low‑concurrency, latency‑tolerant scenarios such as personal assistants or local inference, while quantized builds keep accuracy virtually intact.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
