Google Releases DiffusionGemma 26B MoE—Text Generation Up to 4× Faster
DiffusionGemma, Google's new 26‑billion‑parameter Mixture‑of‑Experts model, replaces token‑by‑token autoregression with a diffusion‑style output head that generates whole text blocks, delivering up to four‑fold speed gains on consumer GPUs while offering bidirectional attention and self‑correction, albeit with lower quality than standard Gemma 4.
DiffusionGemma is an open‑source 26 B parameter Mixture‑of‑Experts (MoE) model released under the Apache 2.0 license. It builds on the Gemma 4 family and incorporates research from Gemini Diffusion, adding a diffusion‑style output head designed to maximise generation speed.
Generation speed and throughput
Instead of autoregressive token‑by‑token prediction, DiffusionGemma generates a full 256‑token block in a single forward pass. By moving the decoding bottleneck from memory bandwidth to raw compute, token throughput can increase up to four‑fold: more than 1,000 tokens / s on an NVIDIA H100 and around 700 tokens / s on an RTX 5090.
Parameter activation and hardware requirements
The model has a total of 26 B parameters, but only about 3.8 B are activated during inference. After quantisation the model fits within the 18 GB VRAM limit of high‑end consumer GPUs.
Bidirectional attention and parallelism
Bidirectional attention lets each of the 256 generated tokens attend to every other token. This parallelism benefits non‑linear generation scenarios such as in‑line editing, code completion, amino‑acid sequence creation, or graph‑structured mathematical expressions. The model iteratively refines its output, allowing it to inspect the whole text block and correct errors across multiple refinement steps.
Quality trade‑off
Because speed is prioritised, overall output quality is lower than that of the standard Gemma 4 model, which remains the recommended choice for applications where the highest quality is essential.
Fine‑tuning example
Unsloth fine‑tuned DiffusionGemma to solve Sudoku puzzles. The task demonstrates that the model’s bidirectional attention makes strong token‑to‑token dependencies easier to handle compared with conventional autoregressive models.
Use‑case considerations
The throughput advantage is most pronounced on a single accelerator with low‑to‑medium batch sizes, making the model valuable for real‑time, interactive AI applications that run locally and demand low latency. In high‑QPS cloud environments where autoregressive models can be heavily batched, the parallel decoding benefit diminishes and may increase service costs.
Why diffusion‑style generation improves hardware utilisation
Traditional autoregressive models process one token at a time, leading to low GPU utilisation for single‑user, local inference because the hardware often waits for the next token. DiffusionGemma drafts the entire 256‑token block at once, giving the processor a larger compute workload per forward pass and raising utilisation, analogous to switching from a typewriter to a high‑speed printer.
For the full announcement, see the Google blog post: https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
