How Unsloth Packs Google’s DiffusionGemma into 18 GB and Achieves 2000+ Tokens/s on a Single GPU

Unsloth quantizes Google’s DiffusionGemma into five GGUF variants, the smallest fitting a 24 GB GPU, adds a dedicated llama‑diffusion‑cli, and demonstrates over 2000 tokens per second on an RTX 6000, while outlining usage steps, model‑size trade‑offs, and limitations.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
How Unsloth Packs Google’s DiffusionGemma into 18 GB and Achieves 2000+ Tokens/s on a Single GPU

Overview

DiffusionGemma‑26B‑A4B‑it was quantized to five GGUF variants and integrated into llama.cpp via PR #24423, adding a dedicated llama-diffusion-cli binary. On an RTX 6000 the model reaches >2000 tokens/s for a single request, roughly double the 1000 tokens/s achieved by vLLM on an H100.

Quantization variants

BF16 – 47 GB, full‑precision reference (not intended for everyday use).

Q8_0 – 25 GB, near‑lossless; fits a single 32 GB+ GPU (e.g., RTX 6000 Pro, V100 32G).

Q6_K – 21 GB, balanced middle ground.

Q5_K_M – 18 GB, suitable for memory‑constrained setups.

Q4_K_M – 16 GB, smallest; fits a single 24 GB GPU (4090/3090/RTX 6000).

“Fit” means the model can be loaded; Unsloth recommends total system memory (RAM + VRAM) of at least 18 GB to accommodate KV cache and canvas buffers.

Method 1 – llama.cpp native route

This approach is for users who prefer command‑line control.

Compile the dedicated branch

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
gh pr checkout 24423
# CUDA build (disable for Apple/Metal)
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release --target llama-diffusion-cli

The build produces the binary llama-diffusion-cli , not the standard llama-cli .

Download a GGUF file

pip install -U "huggingface_hub[cli]"
hf download unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --local-dir unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --include "Q4_K_M"   # replace with "Q8_0" for that variant

Run a conversation

./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -ngl 99 -cnv -n 2048

Parameters: -ngl 99 unloads all layers to the GPU (use -ngl 0 for CPU‑only). -cnv enables multi‑turn conversation mode. -n 2048 sets the target token count; this also determines the number of diffusion blocks and the batch/context size.

The entropy‑bound sampler is enabled by default with temperature linearly decaying from 0.8 to 0.4, entropy cap 0.1, and a maximum of 48 denoising steps – the recommended configuration for DiffusionGemma.

Method 2 – Unsloth Studio one‑click route

Unsloth Studio bundles DiffusionGemma, removing the need to compile llama.cpp. It is an open‑source local AI Web UI (macOS, Windows, Linux).

Installation (choose one):

# macOS / Linux / WSL
curl -fsSL https://unsloth.ai/install.sh | sh
# Windows PowerShell
irm https://unsloth.ai/install.ps1 | iex

Start the UI:

unsloth studio -H 0.0.0.0 -p 8888

Open http://127.0.0.1:8888, set a password, go to the Studio Chat tab, search for “DiffusionGemma”, select a quantization version, and begin chatting. All diffusion sampling parameters are pre‑configured.

Real‑time diffusion visualisation

Add the --diffusion-visual flag to watch the 256‑token canvas denoise step‑by‑step:

./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -ngl 99 -cnv -n 2048 --diffusion-visual

Fine‑tuning capability

Unsloth provides a Colab notebook (requires an A100) that pre‑sets all diffusion‑specific parameters. The official demo fine‑tunes the model on a Sudoku dataset: the base model produces random outputs, while after SFT it reliably solves every puzzle.

Speed trade‑offs

First‑token latency (TTFT) remains high because the model must denoise the entire 256‑token canvas before emitting the first token.

Limited concurrency : each conversation maintains a canvas × vocab‑size buffer, consuming several times more VRAM than autoregressive models; suitable for single‑user scenarios but not high‑throughput services.

Accuracy impact : MMLU Pro 77.6 % vs 82.6 % (Gemma 4), AIME 2026 69.1 % vs 88.3 %, Codeforces ELO 1429 vs 1718 – roughly a 5‑15 % drop in exchange for speed.

PR status : PR #24423 is still a draft marked “too large”; native llama.cpp support will be available only after the PR is merged.

Recommended scenarios

24 GB single‑GPU inference (4090/3090/RTX 6000) – use Q4_K_M.

Apple Silicon with large unified memory – use Q4_K_M or Q5_K_M (Metal support built‑in).

Private‑domain SFT – diffusion fine‑tuning pipeline available.

Experiencing diffusion visualisation – use --diffusion-visual.

High‑concurrency API services – not suitable; autoregressive models are preferable.

Olympiad‑level reasoning or competition programming – run Gemma 4 autoregressive version instead.

Streaming chat or real‑time typing – TTFT is too slow for these use cases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

quantizationGPUllama.cppGGUFUnslothDiffusionGemma
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.