Google Releases DiffusionGemma 26B MoE—Text Generation Up to 4× Faster

DiffusionGemma, Google's new 26‑billion‑parameter Mixture‑of‑Experts model, replaces token‑by‑token autoregression with a diffusion‑style output head that generates whole text blocks, delivering up to four‑fold speed gains on consumer GPUs while offering bidirectional attention and self‑correction, albeit with lower quality than standard Gemma 4.

Machine Heart
Machine Heart
Machine Heart
Google Releases DiffusionGemma 26B MoE—Text Generation Up to 4× Faster

DiffusionGemma is an open‑source 26 B parameter Mixture‑of‑Experts (MoE) model released under the Apache 2.0 license. It builds on the Gemma 4 family and incorporates research from Gemini Diffusion, adding a diffusion‑style output head designed to maximise generation speed.

Generation speed and throughput

Instead of autoregressive token‑by‑token prediction, DiffusionGemma generates a full 256‑token block in a single forward pass. By moving the decoding bottleneck from memory bandwidth to raw compute, token throughput can increase up to four‑fold: more than 1,000 tokens / s on an NVIDIA H100 and around 700 tokens / s on an RTX 5090.

Parameter activation and hardware requirements

The model has a total of 26 B parameters, but only about 3.8 B are activated during inference. After quantisation the model fits within the 18 GB VRAM limit of high‑end consumer GPUs.

Bidirectional attention and parallelism

Bidirectional attention lets each of the 256 generated tokens attend to every other token. This parallelism benefits non‑linear generation scenarios such as in‑line editing, code completion, amino‑acid sequence creation, or graph‑structured mathematical expressions. The model iteratively refines its output, allowing it to inspect the whole text block and correct errors across multiple refinement steps.

Quality trade‑off

Because speed is prioritised, overall output quality is lower than that of the standard Gemma 4 model, which remains the recommended choice for applications where the highest quality is essential.

Fine‑tuning example

Unsloth fine‑tuned DiffusionGemma to solve Sudoku puzzles. The task demonstrates that the model’s bidirectional attention makes strong token‑to‑token dependencies easier to handle compared with conventional autoregressive models.

Use‑case considerations

The throughput advantage is most pronounced on a single accelerator with low‑to‑medium batch sizes, making the model valuable for real‑time, interactive AI applications that run locally and demand low latency. In high‑QPS cloud environments where autoregressive models can be heavily batched, the parallel decoding benefit diminishes and may increase service costs.

Why diffusion‑style generation improves hardware utilisation

Traditional autoregressive models process one token at a time, leading to low GPU utilisation for single‑user, local inference because the hardware often waits for the next token. DiffusionGemma drafts the entire 256‑token block at once, giving the processor a larger compute workload per forward pass and raising utilisation, analogous to switching from a typewriter to a high‑speed printer.

For the full announcement, see the Google blog post: https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

GPU AccelerationMixture of Expertslocal inferenceDiffusionGemmabidirectional attentiontext diffusion
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.