Deploy Gemma 4 Locally: Ollama, llama.cpp, MLX, vLLM + TurboQuant Optimization

The article reviews the four Gemma 4 model variants, analyzes their architecture and benchmark results versus Qwen3.5, and provides step‑by‑step instructions for local deployment using Ollama, llama.cpp, MLX and vLLM, while highlighting TurboQuant memory and weight compression techniques.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Deploy Gemma 4 Locally: Ollama, llama.cpp, MLX, vLLM + TurboQuant Optimization

Gemma 4 releases four model sizes—31B Dense, 26B MoE, 4B (E4B) and 2.3B (E2B)—each targeting different hardware: desktop workstations, single‑card H100, mobile/Jetson/Raspberry Pi and edge devices. The "E" models use Per‑Layer Embeddings (PLE) to maximize effective parameter efficiency.

Unified capabilities

Multimodal input : all models support image and video; smaller models also accept audio and speech.

Extended context : 256K tokens for large models, 128K for small ones.

Agent workflow : native function calls, structured JSON output, system instructions.

140+ languages : native training supports over 140 languages.

Code generation : high‑quality offline code assistance.

Benchmark performance

Google’s official leaderboard places Gemma 4 31B third on the Arena AI text track and the 26B MoE sixth, noting they outperform models 20× larger. Third‑party Artificial Analysis scores the 31B version 85.7% on GPQA Diamond, second only to Qwen 3.5 27B (85.8%). Token efficiency is higher: the 31B uses ~1.2 M output tokens versus 1.5‑1.6 M for Qwen 3.5 variants.

Head‑to‑head with Qwen 3.5 27B

Item‑by‑item comparison shows Qwen 3.5 leads on most metrics, yet Gemma 4 31B matches its Elo score on the Arena AI leaderboard, indicating comparable human‑preference quality despite differing benchmark numbers.

Architecture analysis

According to AI blogger Sebastian Raschka, Gemma 4 retains a classic Pre/Post‑norm layout with a 5:1 mixed‑attention scheme (local sliding‑window + global full attention) and Grouped Query Attention (GQA). Performance gains stem mainly from improved training data and methods rather than architectural changes.

Local deployment guide

Ollama (0.20+) supports all four variants:

ollama run gemma4:e2b   # 2B effective, edge
ollama run gemma4:e4b   # 4B effective, mobile
ollama run gemma4:26b   # 26B MoE (4B active)
ollama run gemma4:31b   # 31B Dense

llama.cpp (install latest via Homebrew):

brew install llama.cpp --HEAD
llama-server -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M

MLX (Mac) adds full‑suite support, including visual and audio models. Install with: uv pip install -U mlx-vlm TurboQuant KV‑cache compression reduces memory usage dramatically (e.g., 13.3 GB → 4.9 GB for 31B) while preserving quality.

Run a compressed model on Mac:

uv run mlx_vlm.generate --model google/gemma-4-31b-it --kv-bits 3.5 --kv-quant-scheme turboquant

Unsloth quantized builds enable the 2‑4 B models to run with ~6 GB RAM, and the larger models with ~18 GB.

vLLM offers native multimodal support and 256K context across GPUs and TPUs. Example tool‑call benchmark shows Gemma 4 31B and Qwen 3.5 27B both achieve a perfect 15/15 score.

Real‑world performance

26B MoE on a single RTX 4090: decode 162 token/s, pre‑fill 8,400 token/s, full 262K context, 19.5 GB VRAM.

Dual‑GPU (RTX 4090 + RTX 3090) Q8_0‑quantized 31B Dense: pre‑fill 9,024 token/s, full 262K context 2,537 token/s.

TurboQuant+ weight compression shrinks 31B from 30.4 GB to 18.9 GB.

NVIDIA optimizations

Google and NVIDIA co‑optimized Gemma 4 for RTX GPUs, DGX Spark, and Jetson Orin Nano. Benchmarks using llama‑bench (Q4_K_M, batch 1, input 4096, output 128) show strong throughput on RTX 5090 and Apple M3 Ultra.

Community feedback

Developer Simon Willison reported that 2B, 4B, and 26B MoE run fine in LM Studio, but the 31B Dense loops on "---\n". He also noted audio input is not yet supported in LM Studio or Ollama.

Summary of strengths and weaknesses

Apache 2.0 license enables unrestricted commercial use.

Exceptional parameter efficiency—31B rivals much larger models.

MoE version offers high performance on a single 4090.

Native multimodal, tool‑call, and long‑context support ready for agent development.

Edge models run on phones and Raspberry Pi with ~6 GB RAM.

Broad ecosystem: Ollama, llama.cpp, vLLM, MLX all support Day‑1.

TurboQuant+ cuts 31B weight to 18.9 GB and reduces KV cache by 63%.

Compared to Qwen 3.5 27B, Gemma 4 lags slightly on many benchmark items.

Small models’ tool‑call ability trails similarly sized Qwen models.

31B Dense still has early bugs in some inference frameworks.

Audio input currently only available via Google AI Studio.

Recommendations

Choose Gemma 4 for commercial open‑source deployments thanks to its permissive license.

Prefer the 26B MoE for local use—fast and memory‑efficient.

Use the dense 31B when maximum quality is required.

Mac developers should adopt MLX for the best experience.

Edge developers should target E2B/E4B for 6 GB‑RAM agent workloads.

Local DeploymentOllamallama.cppAI benchmarkingMLXTurboQuantGemma 4
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.