How DiffusionGemma Shifts LLM Inference Bottleneck from Memory Bandwidth to Compute
DiffusionGemma, an experimental discrete text diffusion model built on the 26B MoE Gemma‑4 architecture, generates whole 256‑token blocks with bidirectional attention, moving the inference bottleneck from memory bandwidth to GPU compute, achieving up to four‑fold speed gains on H100 and RTX 5090 GPUs, though with lower output quality than standard autoregressive models.
