DiffusionGemma Boosts Text Generation Speed Up to 4× with Discrete Diffusion
Google’s open‑source DiffusionGemma model leverages a 26‑billion‑parameter Mixture‑of‑Experts architecture and discrete diffusion decoding to generate whole text blocks, achieving up to four times faster generation—over 1100 tokens/s on an NVIDIA H100 and 700 tokens/s on an RTX 5090—while activating only 3.8 billion parameters during inference.
On June 11, Google open‑sourced DiffusionGemma, a text‑generation model built on discrete diffusion technology and the Gemma 4 series. The model incorporates a new Diffusion Head that enables generation of entire text blocks instead of token‑by‑token output, and it refines results through multiple rounds of parallel denoising, which the author claims yields up to a four‑fold speed increase.
Architecturally, DiffusionGemma uses a 26‑billion‑parameter Mixture‑of‑Experts (MoE) design with a total parameter count of roughly 252 billion, but only about 3.8 billion parameters are activated during inference. This selective activation preserves strong inference capability while markedly reducing computational cost. The model follows an encoder‑decoder structure with bidirectional attention, can process 256 tokens in parallel, and supports up to a 256 K token context, multimodal image‑text input, and a special <|think|> inference mode.
Official benchmarks show that DiffusionGemma generates more than 1,100 tokens per second on a single NVIDIA H100 GPU and exceeds 700 tokens per second on a GeForce RTX 5090, surpassing comparable autoregressive models of the same size. Although Google notes that the standard Gemma 4 model remains the preferred production choice for quality, DiffusionGemma demonstrates an alternative high‑efficiency route for large‑language‑model generation.
To lower the entry barrier for developers, HyperAI quickly released an easy‑to‑deploy notebook that runs on a single NVIDIA RTX Pro 6000 card. The tutorial guides users through cloning the notebook repository, selecting the RTX Pro 6000 and vLLM image, allocating resources, and accessing a Jupyter workspace where the model can be executed and its outputs inspected.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
