Google Pushes Full Throttle: Run Gemma 4 Large Models Locally with MTP Acceleration
Google’s Gemma 4 QAT release compresses models to under 1 GB, enabling 26B‑parameter MoE inference on a 16 GB MacBook and mobile‑optimized versions under 1 GB, while preserving quality through Quantization‑Aware Training and offering a full toolchain for local deployment.
Overview
Google has accelerated the Gemma 4 series with a Quantization‑Aware Training (QAT) checkpoint that fits inside 1 GB for the E2B variant and runs a 26B MoE model on a laptop with only 16 GB of RAM, opening the door for true local deployment of large language models.
Key Highlights
E2B 1 GB runtime : By removing per‑layer embeddings, the pure‑text version occupies less than 1 GB, making inference feasible on mobile devices.
26B MoE on a notebook : Previously requiring a workstation, the 26B‑A4B model now runs on a 16 GB MacBook (Unsloth measured 15 GB usage).
Quality largely retained : QAT reduces memory while keeping accuracy higher than post‑training quantization (PTQ).
Mobile‑Specific Quantization Scheme
Google designed a dedicated quantization schema for mobile, addressing four dimensions:
Static activations : Scale parameters are pre‑computed during training, eliminating a runtime computation step on the device.
Channel‑wise quantization : Aligns compressed data structures with mobile accelerator hardware, allowing native execution without a slow compatibility path.
Targeted 2‑bit quantization : Aggressively quantizes token‑generation layers to 2‑bit while preserving higher precision in core inference layers.
Embedding and KV‑cache optimization : Separately compresses the vocabulary and short‑term memory to prevent memory blow‑up in long conversations.
The principle is to “save wherever possible while keeping the essential parts untouched,” resulting in fast inference on phone chips.
Installation
The quickest way to try the models is with Ollama:
# Edge‑small models (laptop/phone)
ollama run gemma4:e2b
ollama run gemma4:e4b
# Workstation inference
ollama run gemma4:12b
ollama run gemma4:26b
ollama run gemma4:31bLM Studio also supports immediate download, and the GGUF format can be fed directly to llama.cpp or vLLM. Non‑quantized checkpoints can be converted to the desired format.
Recommended Runtime Parameters
temperature = 1.0
top_p = 0.95
top_k = 64Context length depends on the model:
E2B, E4B: 128 K tokens
12B, 26B‑A4B, 31B: 256 K tokens
Hardware requirements (RAM + VRAM) reported by Unsloth:
Gemma 4 E2B QAT – 3 GB
Gemma 4 E4B QAT – 5 GB
Gemma 4 12B QAT – 7 GB
Gemma 4 26B‑A4B QAT – 15 GB
Gemma 4 31B QAT – 18 GB
Practical implications:
16 GB MacBook can run the 26B MoE model.
24 GB GPU can handle the 31B dense model.
A 16 GB Mac mini can comfortably run the 12B model with headroom.
Toolchain Overview
llama.cpp / GGUF users : Pull the Q4_0 GGUF checkpoint.
Desktop GUI : LM Studio, Ollama.
Edge deployment : LiteRT‑LM (Google’s lightweight runtime).
Browser inference : Transformers.js + ONNX.
Service deployment : vLLM, SGLang.
Apple Silicon : MLX community build.
Fine‑tuning : Hugging Face Transformers + Unsloth.
Unsloth’s Additional Optimizations
Unsloth discovered that directly converting the official QAT BF16 checkpoint to llama.cpp ’s Q4_0 format drops accuracy because llama.cpp uses an F16 scale while QAT training uses BF16 scale.
For the 26B‑A4B model, naïve conversion yields a top‑1 accuracy of 70.2 %, whereas Unsloth’s dynamic quantization alignment reaches 85.6 % (a 15.6‑point gain) while reducing file size by 200 MB.
Size reductions across models are roughly 72 % compared to the original BF16 checkpoints:
E2B: 2.62 GB vs 9.31 GB (71.86 % saved)
E4B: 4.22 GB vs 15.1 GB (72.05 % saved)
12B: 6.72 GB vs 23.8 GB (71.76 % saved)
26B‑A4B: 14.2 GB vs 50.5 GB (71.88 % saved)
31B: 17.3 GB vs 61.4 GB (71.82 % saved)
Unsloth labels its refined format UD‑Q4_K_XL, which offers better quality than the plain Q4_0.
Takeaways
Local LLM enthusiasts should upgrade to the new QAT models, especially on 16 GB machines where 26B‑level inference becomes feasible.
Mobile developers can achieve sub‑1 GB memory footprints with the mobile‑specific QAT and LiteRT‑LM.
Do not directly convert QAT weights to llama.cpp Q4_0; use the official GGUF or Unsloth’s UD series to avoid accuracy loss.
The mobile‑only format currently ties to LiteRT‑LM; other runtimes still rely on the generic Q4_0 version.
E2B/E4B excel in low‑memory scenarios but lag behind larger models in raw performance; choose models wisely rather than being swayed solely by the 1 GB claim.
Overall, Gemma 4’s QAT update is the most substantial improvement to date, turning “small models can run” into “large models can run on your computer,” especially when combined with prior MTP and 12B updates.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
