Google Boosts Gemma 4 Inference Speed Up to 3× with MTP Drafter and Day‑0 vLLM Support
Google’s new Multi‑Token Prediction (MTP) drafter for Gemma 4 delivers up to three‑fold inference speedups across hardware and frameworks—validated by official benchmarks and independent DGX Spark tests—while preserving identical output quality, and is immediately usable via Hugging Face, vLLM, MLX, Ollama and edge‑device runtimes.
Gemma 4 overview
Gemma 4 is a family of open‑source large language models ranging from 2 B to 31 B parameters. It supports text, image, video, and audio modalities and achieves over 85 % on the MMLU‑Pro benchmark. In the first four weeks after release it recorded more than 60 million downloads.
Multi‑Token Prediction (MTP) speed gains
The Google blog provides a speed‑up chart showing up to three‑fold higher token‑per‑second throughput across various hardware, frameworks, and model sizes when using the MTP drafter.
Why MTP speeds up inference
Standard LLM inference is limited by memory‑bandwidth, not compute.
CPU/GPU spend most of the time moving billions of parameters from VRAM to the compute units for each token, leaving the compute units idle.
MTP uses idle compute to pre‑predict multiple tokens with a lightweight drafter model. The drafter reuses the target model’s KV cache, so no extra context computation is required.
1. Target model (e.g., Gemma 4 31B) + lightweight drafter
2. Drafter reuses activations and KV cache to predict several tokens in parallel
3. Target model verifies tokens, keeps correct spans and generates one extra token
4. Discard incorrect tokens and resume from the divergence point Draft model predicts 4‑8 tokens → Target model checks them → Correct tokens kept, wrong ones recomputedSpeculative decoding made turnkey
Speculative decoding was introduced by Google in 2022 (Fast Inference from Transformers via Speculative Decoding). Gemma 4 adds two contributions:
Official drafter released for each Gemma 4 size, eliminating the need for users to train a drafter.
Apache 2.0 open‑source release, immediately compatible with Hugging Face, Kaggle, and Day‑0 support for major runtimes.
Supported frameworks and entry points
Hugging Face Transformers – https://huggingface.co/collections/google/gemma-4
MLX (Apple Silicon) – https://huggingface.co/collections/mlx-community/gemma-4-assistant-mtp
vLLM (Day‑0) – https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html
SGLang (Day‑0) – https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4
Ollama – ollama run gemma4:31b-coding-mtp-bf16 Google AI Edge Gallery – Android/iOS apps (App Store / Play Store)
Day‑0 vLLM integration
Docker image: docker pull vllm/vllm-openai:gemma4-0505-cu129 Full recipe: https://recipes.vllm.ai/Google/gemma-4-26B-A4B-it
Independent DGX Spark benchmark
On an NVIDIA DGX Spark (GB10) the Gemma 4 31B model was tested with and without MTP:
concurrency = 1: 3.65 → 6.37 tokens/s (1.74×)
concurrency = 4: 14.34 → 23.59 tokens/s (1.65×)
concurrency = 8: 14.37 → 24.18 tokens/s (1.68×)
Google claimed up to 2×; the independent test shows solid, non‑vapor speedups.
DGX Spark (GB10)
+ gemma-4-31b-it
+ gemma-4-31b-it-assistant # MTP drafter
+ vLLM (PR 41745 compiled)Key practical details
On Apple Silicon, batch = 1 for the 26 B MoE model encounters routing challenges; increasing concurrency to 4‑8 yields approximately 2.2× speedup.
Both the 26 B MoE and 31 B dense models run on consumer‑grade GPUs.
Smaller E2B/E4B models reduce CPU wake‑up latency and battery draw on mobile devices.
Zero quality loss: the final output is verified by the target model, identical to non‑MTP inference.
Release timeline
Early April – full‑size multimodal Gemma 4 models released.
May 5 – MTP drafter released to accelerate the same models.
Conclusion
Gemma 4 delivers a 2‑3× speed increase with unchanged output quality, making the model practical for latency‑sensitive, edge, and low‑GPU‑count deployments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
