Google Boosts Gemma 4 Inference Speed Up to 3× with MTP Drafter and Day‑0 vLLM Support

Google’s new Multi‑Token Prediction (MTP) drafter for Gemma 4 delivers up to three‑fold inference speedups across hardware and frameworks—validated by official benchmarks and independent DGX Spark tests—while preserving identical output quality, and is immediately usable via Hugging Face, vLLM, MLX, Ollama and edge‑device runtimes.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Google Boosts Gemma 4 Inference Speed Up to 3× with MTP Drafter and Day‑0 vLLM Support

Gemma 4 overview

Gemma 4 is a family of open‑source large language models ranging from 2 B to 31 B parameters. It supports text, image, video, and audio modalities and achieves over 85 % on the MMLU‑Pro benchmark. In the first four weeks after release it recorded more than 60 million downloads.

Multi‑Token Prediction (MTP) speed gains

The Google blog provides a speed‑up chart showing up to three‑fold higher token‑per‑second throughput across various hardware, frameworks, and model sizes when using the MTP drafter.

Gemma 4 MTP drafter speed ups across hardware
Gemma 4 MTP drafter speed ups across hardware

Why MTP speeds up inference

Standard LLM inference is limited by memory‑bandwidth, not compute.

CPU/GPU spend most of the time moving billions of parameters from VRAM to the compute units for each token, leaving the compute units idle.

MTP uses idle compute to pre‑predict multiple tokens with a lightweight drafter model. The drafter reuses the target model’s KV cache, so no extra context computation is required.

1. Target model (e.g., Gemma 4 31B) + lightweight drafter
2. Drafter reuses activations and KV cache to predict several tokens in parallel
3. Target model verifies tokens, keeps correct spans and generates one extra token
4. Discard incorrect tokens and resume from the divergence point
Draft model predicts 4‑8 tokens → Target model checks them → Correct tokens kept, wrong ones recomputed

Speculative decoding made turnkey

Speculative decoding was introduced by Google in 2022 (Fast Inference from Transformers via Speculative Decoding). Gemma 4 adds two contributions:

Official drafter released for each Gemma 4 size, eliminating the need for users to train a drafter.

Apache 2.0 open‑source release, immediately compatible with Hugging Face, Kaggle, and Day‑0 support for major runtimes.

Supported frameworks and entry points

Hugging Face Transformers – https://huggingface.co/collections/google/gemma-4

MLX (Apple Silicon) – https://huggingface.co/collections/mlx-community/gemma-4-assistant-mtp

vLLM (Day‑0) – https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html

SGLang (Day‑0) – https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4

Ollama – ollama run gemma4:31b-coding-mtp-bf16 Google AI Edge Gallery – Android/iOS apps (App Store / Play Store)

Day‑0 vLLM integration

Docker image: docker pull vllm/vllm-openai:gemma4-0505-cu129 Full recipe: https://recipes.vllm.ai/Google/gemma-4-26B-A4B-it

vLLM Docker image
vLLM Docker image

Independent DGX Spark benchmark

On an NVIDIA DGX Spark (GB10) the Gemma 4 31B model was tested with and without MTP:

concurrency = 1: 3.65 → 6.37 tokens/s (1.74×)

concurrency = 4: 14.34 → 23.59 tokens/s (1.65×)

concurrency = 8: 14.37 → 24.18 tokens/s (1.68×)

Google claimed up to 2×; the independent test shows solid, non‑vapor speedups.
DGX Spark (GB10)
+ gemma-4-31b-it
+ gemma-4-31b-it-assistant   # MTP drafter
+ vLLM (PR 41745 compiled)

Key practical details

On Apple Silicon, batch = 1 for the 26 B MoE model encounters routing challenges; increasing concurrency to 4‑8 yields approximately 2.2× speedup.

Both the 26 B MoE and 31 B dense models run on consumer‑grade GPUs.

Smaller E2B/E4B models reduce CPU wake‑up latency and battery draw on mobile devices.

Zero quality loss: the final output is verified by the target model, identical to non‑MTP inference.

Release timeline

Early April – full‑size multimodal Gemma 4 models released.

May 5 – MTP drafter released to accelerate the same models.

Conclusion

Gemma 4 delivers a 2‑3× speed increase with unchanged output quality, making the model practical for latency‑sensitive, edge, and low‑GPU‑count deployments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Speculative DecodingvLLMLLM Inferencemulti-modalApple SiliconGemma 4MTP drafter
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.