Artificial Intelligence 10 min read

Google’s 26B DiffusionGemma Model Delivers 1000+ Tokens/s – Runs on a 4090

DiffusionGemma, Google DeepMind’s 26B MoE model that generates 256‑token blocks via diffusion, achieves over 1000 tokens per second on H100/H200 GPUs, offers FP8 and NVFP4 quantized versions with near‑lossless accuracy, and can be deployed locally with vLLM Docker images, though it incurs higher first‑token latency and limited concurrency.

Old Zhang's AI Learning

Jun 11, 2026

Google’s 26B DiffusionGemma Model Delivers 1000+ Tokens/s – Runs on a 4090

Overview

DiffusionGemma is a 26‑billion‑parameter model built by Google DeepMind on the Gemma‑4 architecture. It differs from traditional autoregressive (AR) models by generating whole 256‑token blocks at once through a diffusion process, which dramatically reduces KV‑cache reads and better utilizes idle compute.

Total parameters: 25.2 B

Active (MoE) parameters: 3.8 B (128 experts, top‑8 routing)

Context window: 256 K tokens

Canvas length: 256 tokens

Vocabulary size: 262 K

Supported modalities: text, image, video

License: Apache 2.0 (open source)

Why It Is Fast

Traditional AR models emit one token per step, requiring a full KV‑cache read for each token, which leaves much compute idle in low‑batch scenarios. DiffusionGemma instead generates 256 tokens per diffusion step, iterating 10‑20 denoising steps until the canvas stabilises, thereby converting idle compute into lower latency for subsequent tokens. The trade‑off is a higher first‑token latency because the model must finish one full denoising cycle before emitting any output.

Performance Measurements

vLLM’s official batch‑size‑1 tests (FP8‑quantized) show:

H200 GPU: 1,288 tok/s (≈6× faster than AR baseline, 3× faster than MTP baseline)

H100 GPU: 1,008 tok/s (≈5× faster than AR baseline, 2.6× faster than MTP baseline)

More detailed single‑request benchmarks:

Output throughput: 199 tok/s (Gemma‑4 26B AR) vs 375 tok/s (DiffusionGemma) → 1.9× increase

Single‑request generation speed: 205 tok/s vs 1,282 tok/s → 6.2× increase

End‑to‑end latency: 2.87 s vs 0.88 s → 3.3× faster

First‑token latency: 53 ms vs 489 ms (higher for DiffusionGemma due to full‑canvas denoising)

Quantized Versions

RedHat AI released two quantized builds with negligible accuracy loss.

FP8 dynamic quantization – recovery rates: AIME 2025 96.8 %, GPQA Diamond 102.5 %, GSM8K 99.9 %, AIME 2025 Thinking 101.5 %.

NVFP4 quantization – recovery rates: AIME 2025 97.7 %, GPQA Diamond 100.5 %, GSM8K 100 %, AIME 2025 Thinking 98 %.

Both versions retain multimodal support and the “Thinking” mode for stronger reasoning.

Local Deployment

DiffusionGemma runs via the vLLM Gemma Docker image.

# Pull the image
docker pull vllm/vllm-openai:gemma

# Single‑GPU deployment (H100/H200)
docker run -itd --name diffusiongemma \
    --ipc=host --network host --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vllm/vllm-openai:gemma \
    --model google/diffusiongemma-26B-A4B-it \
    --max-model-len 262144 \
    --max-num-seqs 4 \
    --gpu-memory-utilization 0.85 \
    --host 0.0.0.0 --port 8000

For the FP8 quantized model:

VLLM_USE_V2_MODEL_RUNNER=1
vllm serve RedHatAI/diffusiongemma-26B-A4B-it-FP8-dynamic \
    --trust-remote-code \
    --attention-backend TRITON_ATTN \
    --max-num-seqs 4 \
    --hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' \
    --default-chat-template-kwargs '{"enable_thinking": true}'

Key runtime flags: --max-num-seqs 4: limits concurrency because the diffusion canvas state is large. --gpu-memory-utilization 0.85: reserves memory for the denoising step. --attention-backend TRITON_ATTN: required for the quantized builds.

Usage

After deployment, the API matches the OpenAI format.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="google/diffusiongemma-26B-A4B-it",
    messages=[{"role": "user", "content": "用Python写一个快速排序"}],
    max_tokens=512,
    temperature=0.7,
)
print(response.choices[0].message.content)

Enabling the Thinking mode for stronger reasoning:

response = client.chat.completions.create(
    model="google/diffusiongemma-26B-A4B-it",
    messages=[{"role": "user", "content": "求 x^3 * ln(x) 的导数"}],
    max_tokens=32768,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

Pros and Cons

Pros:

Inference speed >1000 tok/s, 5‑6× faster than AR models.

MoE architecture keeps active parameters low (3.8 B), enabling single‑GPU runs.

FP8 and NVFP4 quantizations retain near‑lossless accuracy.

Supports multimodal inputs (text, image, video).

Thinking mode improves complex reasoning.

Apache 2.0 license allows unrestricted commercial use.

OpenAI‑compatible API simplifies integration.

Cons:

High first‑token latency (~489 ms) makes it unsuitable for latency‑sensitive streaming chats.

Concurrency limited to ≤4 sequences per GPU.

Currently only supported by vLLM; ecosystem still maturing.

Benchmark scores slightly lower than same‑size Gemma‑4 (e.g., AIME 69.1 % vs 88.3 %).

Conclusion

DiffusionGemma demonstrates that diffusion‑based generation can push 26 B‑scale open‑source models to >1000 tokens/s on a single H100/H200, offering a compelling speed advantage for low‑concurrency, latency‑tolerant scenarios such as personal assistants or local inference, while quantized builds keep accuracy virtually intact.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

vLLM Google AI MoE FP8 quantization 26B model DiffusionGemma

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.