How Unsloth Packs Google’s DiffusionGemma into 18 GB and Achieves 2000+ Tokens/s on a Single GPU
Unsloth quantizes Google’s DiffusionGemma into five GGUF variants, the smallest fitting a 24 GB GPU, adds a dedicated llama‑diffusion‑cli, and demonstrates over 2000 tokens per second on an RTX 6000, while outlining usage steps, model‑size trade‑offs, and limitations.
Overview
DiffusionGemma‑26B‑A4B‑it was quantized to five GGUF variants and integrated into llama.cpp via PR #24423, adding a dedicated llama-diffusion-cli binary. On an RTX 6000 the model reaches >2000 tokens/s for a single request, roughly double the 1000 tokens/s achieved by vLLM on an H100.
Quantization variants
BF16 – 47 GB, full‑precision reference (not intended for everyday use).
Q8_0 – 25 GB, near‑lossless; fits a single 32 GB+ GPU (e.g., RTX 6000 Pro, V100 32G).
Q6_K – 21 GB, balanced middle ground.
Q5_K_M – 18 GB, suitable for memory‑constrained setups.
Q4_K_M – 16 GB, smallest; fits a single 24 GB GPU (4090/3090/RTX 6000).
“Fit” means the model can be loaded; Unsloth recommends total system memory (RAM + VRAM) of at least 18 GB to accommodate KV cache and canvas buffers.
Method 1 – llama.cpp native route
This approach is for users who prefer command‑line control.
Compile the dedicated branch
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
gh pr checkout 24423
# CUDA build (disable for Apple/Metal)
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release --target llama-diffusion-cliThe build produces the binary llama-diffusion-cli , not the standard llama-cli .
Download a GGUF file
pip install -U "huggingface_hub[cli]"
hf download unsloth/diffusiongemma-26B-A4B-it-GGUF \
--local-dir unsloth/diffusiongemma-26B-A4B-it-GGUF \
--include "Q4_K_M" # replace with "Q8_0" for that variantRun a conversation
./build/bin/llama-diffusion-cli \
-m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
-ngl 99 -cnv -n 2048Parameters: -ngl 99 unloads all layers to the GPU (use -ngl 0 for CPU‑only). -cnv enables multi‑turn conversation mode. -n 2048 sets the target token count; this also determines the number of diffusion blocks and the batch/context size.
The entropy‑bound sampler is enabled by default with temperature linearly decaying from 0.8 to 0.4, entropy cap 0.1, and a maximum of 48 denoising steps – the recommended configuration for DiffusionGemma.
Method 2 – Unsloth Studio one‑click route
Unsloth Studio bundles DiffusionGemma, removing the need to compile llama.cpp. It is an open‑source local AI Web UI (macOS, Windows, Linux).
Installation (choose one):
# macOS / Linux / WSL
curl -fsSL https://unsloth.ai/install.sh | sh
# Windows PowerShell
irm https://unsloth.ai/install.ps1 | iexStart the UI:
unsloth studio -H 0.0.0.0 -p 8888Open http://127.0.0.1:8888, set a password, go to the Studio Chat tab, search for “DiffusionGemma”, select a quantization version, and begin chatting. All diffusion sampling parameters are pre‑configured.
Real‑time diffusion visualisation
Add the --diffusion-visual flag to watch the 256‑token canvas denoise step‑by‑step:
./build/bin/llama-diffusion-cli \
-m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
-ngl 99 -cnv -n 2048 --diffusion-visualFine‑tuning capability
Unsloth provides a Colab notebook (requires an A100) that pre‑sets all diffusion‑specific parameters. The official demo fine‑tunes the model on a Sudoku dataset: the base model produces random outputs, while after SFT it reliably solves every puzzle.
Speed trade‑offs
First‑token latency (TTFT) remains high because the model must denoise the entire 256‑token canvas before emitting the first token.
Limited concurrency : each conversation maintains a canvas × vocab‑size buffer, consuming several times more VRAM than autoregressive models; suitable for single‑user scenarios but not high‑throughput services.
Accuracy impact : MMLU Pro 77.6 % vs 82.6 % (Gemma 4), AIME 2026 69.1 % vs 88.3 %, Codeforces ELO 1429 vs 1718 – roughly a 5‑15 % drop in exchange for speed.
PR status : PR #24423 is still a draft marked “too large”; native llama.cpp support will be available only after the PR is merged.
Recommended scenarios
24 GB single‑GPU inference (4090/3090/RTX 6000) – use Q4_K_M.
Apple Silicon with large unified memory – use Q4_K_M or Q5_K_M (Metal support built‑in).
Private‑domain SFT – diffusion fine‑tuning pipeline available.
Experiencing diffusion visualisation – use --diffusion-visual.
High‑concurrency API services – not suitable; autoregressive models are preferable.
Olympiad‑level reasoning or competition programming – run Gemma 4 autoregressive version instead.
Streaming chat or real‑time typing – TTFT is too slow for these use cases.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
