Deploy Gemma 4 Locally: Ollama, llama.cpp, MLX, vLLM + TurboQuant Optimization
The article reviews the four Gemma 4 model variants, analyzes their architecture and benchmark results versus Qwen3.5, and provides step‑by‑step instructions for local deployment using Ollama, llama.cpp, MLX and vLLM, while highlighting TurboQuant memory and weight compression techniques.
Gemma 4 releases four model sizes—31B Dense, 26B MoE, 4B (E4B) and 2.3B (E2B)—each targeting different hardware: desktop workstations, single‑card H100, mobile/Jetson/Raspberry Pi and edge devices. The "E" models use Per‑Layer Embeddings (PLE) to maximize effective parameter efficiency.
Unified capabilities
Multimodal input : all models support image and video; smaller models also accept audio and speech.
Extended context : 256K tokens for large models, 128K for small ones.
Agent workflow : native function calls, structured JSON output, system instructions.
140+ languages : native training supports over 140 languages.
Code generation : high‑quality offline code assistance.
Benchmark performance
Google’s official leaderboard places Gemma 4 31B third on the Arena AI text track and the 26B MoE sixth, noting they outperform models 20× larger. Third‑party Artificial Analysis scores the 31B version 85.7% on GPQA Diamond, second only to Qwen 3.5 27B (85.8%). Token efficiency is higher: the 31B uses ~1.2 M output tokens versus 1.5‑1.6 M for Qwen 3.5 variants.
Head‑to‑head with Qwen 3.5 27B
Item‑by‑item comparison shows Qwen 3.5 leads on most metrics, yet Gemma 4 31B matches its Elo score on the Arena AI leaderboard, indicating comparable human‑preference quality despite differing benchmark numbers.
Architecture analysis
According to AI blogger Sebastian Raschka, Gemma 4 retains a classic Pre/Post‑norm layout with a 5:1 mixed‑attention scheme (local sliding‑window + global full attention) and Grouped Query Attention (GQA). Performance gains stem mainly from improved training data and methods rather than architectural changes.
Local deployment guide
Ollama (0.20+) supports all four variants:
ollama run gemma4:e2b # 2B effective, edge
ollama run gemma4:e4b # 4B effective, mobile
ollama run gemma4:26b # 26B MoE (4B active)
ollama run gemma4:31b # 31B Densellama.cpp (install latest via Homebrew):
brew install llama.cpp --HEAD
llama-server -hf ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_MMLX (Mac) adds full‑suite support, including visual and audio models. Install with: uv pip install -U mlx-vlm TurboQuant KV‑cache compression reduces memory usage dramatically (e.g., 13.3 GB → 4.9 GB for 31B) while preserving quality.
Run a compressed model on Mac:
uv run mlx_vlm.generate --model google/gemma-4-31b-it --kv-bits 3.5 --kv-quant-scheme turboquantUnsloth quantized builds enable the 2‑4 B models to run with ~6 GB RAM, and the larger models with ~18 GB.
vLLM offers native multimodal support and 256K context across GPUs and TPUs. Example tool‑call benchmark shows Gemma 4 31B and Qwen 3.5 27B both achieve a perfect 15/15 score.
Real‑world performance
26B MoE on a single RTX 4090: decode 162 token/s, pre‑fill 8,400 token/s, full 262K context, 19.5 GB VRAM.
Dual‑GPU (RTX 4090 + RTX 3090) Q8_0‑quantized 31B Dense: pre‑fill 9,024 token/s, full 262K context 2,537 token/s.
TurboQuant+ weight compression shrinks 31B from 30.4 GB to 18.9 GB.
NVIDIA optimizations
Google and NVIDIA co‑optimized Gemma 4 for RTX GPUs, DGX Spark, and Jetson Orin Nano. Benchmarks using llama‑bench (Q4_K_M, batch 1, input 4096, output 128) show strong throughput on RTX 5090 and Apple M3 Ultra.
Community feedback
Developer Simon Willison reported that 2B, 4B, and 26B MoE run fine in LM Studio, but the 31B Dense loops on "---\n". He also noted audio input is not yet supported in LM Studio or Ollama.
Summary of strengths and weaknesses
Apache 2.0 license enables unrestricted commercial use.
Exceptional parameter efficiency—31B rivals much larger models.
MoE version offers high performance on a single 4090.
Native multimodal, tool‑call, and long‑context support ready for agent development.
Edge models run on phones and Raspberry Pi with ~6 GB RAM.
Broad ecosystem: Ollama, llama.cpp, vLLM, MLX all support Day‑1.
TurboQuant+ cuts 31B weight to 18.9 GB and reduces KV cache by 63%.
Compared to Qwen 3.5 27B, Gemma 4 lags slightly on many benchmark items.
Small models’ tool‑call ability trails similarly sized Qwen models.
31B Dense still has early bugs in some inference frameworks.
Audio input currently only available via Google AI Studio.
Recommendations
Choose Gemma 4 for commercial open‑source deployments thanks to its permissive license.
Prefer the 26B MoE for local use—fast and memory‑efficient.
Use the dense 31B when maximum quality is required.
Mac developers should adopt MLX for the best experience.
Edge developers should target E2B/E4B for 6 GB‑RAM agent workloads.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
