Gemma 4: Native Multimodal Model That Packs Large‑Model Performance into a Small Footprint
Google DeepMind's Gemma 4 family introduces four open‑source models—including a 31B dense and a 26B MoE variant with 256K context—that deliver multimodal capabilities, tool‑use functions, and benchmark results rivaling much larger models while running on a single H100 GPU.
Google DeepMind has released Gemma 4, an open‑source multimodal model family comprising four variants: E2B (2.3 B effective parameters), E4B (4.5 B), a dense 31 B model, and a 26 B MoE model (A4B) with 4 B active parameters. Both the 31 B and 26 B A4B models support a 256 K context window and can run on a single H100 GPU.
Architecturally, Gemma 4 mirrors Gemma 3, retaining the distinctive pre‑norm/post‑norm hybrid and a 5:1 attention mix (five sliding‑window local layers plus one global layer). It uses classic Grouped Query Attention (GQA), a vocabulary of 262 K tokens, and expands the maximum context length from 128 K to 256 K.
Technical highlights include:
256 K context window – one of the largest among open‑source models, enabling whole‑code‑base or ultra‑long‑document processing in a single pass.
Native multimodal support – vision and audio are built‑in; E2B and E4B also handle local audio, making the models suitable for on‑device OCR, chart understanding, or speech interaction.
Native tool use – the models can invoke functions, emit structured JSON, and execute native system commands, providing genuine agent‑level capabilities.
This is the first Gemma series release that truly supports multimodality beyond text and images, extending to video and, for the smaller models, audio.
In terms of benchmarks, the 31 B version ranks third among open‑source models on the Arena leaderboard, while the 26 B MoE version sits sixth. On the GPQA Diamond scientific‑reasoning benchmark, Gemma 4 31 B scores 85.7 %, just 0.1 percentage points behind Qwen 3.5 27 B, while using only about 1.2 M output tokens compared to Qwen’s 1.5 M, indicating higher token‑efficiency.
Hardware adaptation is robust: the 31 B model’s bfloat16 weights fit into a single 80 GB H100, and a quantized version can run on consumer‑grade GPUs. The smaller E2B and E4B models have been optimized for offline execution on Pixel phones and Jetson devices, delivering latency that is essentially unnoticeable.
Ecosystem support arrived quickly. Libraries such as transformers, llama.cpp, MLX, transformers.js, and Mistral.rs added Gemma 4 compatibility. Hugging Face’s TRL was updated for multimodal tool‑call integration, and vLLM can launch the model with a single Docker command:
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:gemma4 \
--model google/gemma-4-31B-itThe release also switches the license to the more permissive Apache 2.0, allowing broader commercial use, and the model weights have been uploaded to Hugging Face.
AI Engineering
Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
