Google’s 26B DiffusionGemma Model Delivers 1000+ Tokens/s – Runs on a 4090
DiffusionGemma, Google DeepMind’s 26B MoE model that generates 256‑token blocks via diffusion, achieves over 1000 tokens per second on H100/H200 GPUs, offers FP8 and NVFP4 quantized versions with near‑lossless accuracy, and can be deployed locally with vLLM Docker images, though it incurs higher first‑token latency and limited concurrency.
