Jun 11, 2026 · Artificial Intelligence

Google’s 26B DiffusionGemma Model Delivers 1000+ Tokens/s – Runs on a 4090

DiffusionGemma, Google DeepMind’s 26B MoE model that generates 256‑token blocks via diffusion, achieves over 1000 tokens per second on H100/H200 GPUs, offers FP8 and NVFP4 quantized versions with near‑lossless accuracy, and can be deployed locally with vLLM Docker images, though it incurs higher first‑token latency and limited concurrency.

26B modelDiffusionGemmaFP8 quantization

0 likes · 10 min read

Google’s 26B DiffusionGemma Model Delivers 1000+ Tokens/s – Runs on a 4090