Google Pushes Full Throttle: Run Gemma 4 Large Models Locally with MTP Acceleration

Google’s Gemma 4 QAT release compresses models to under 1 GB, enabling 26B‑parameter MoE inference on a 16 GB MacBook and mobile‑optimized versions under 1 GB, while preserving quality through Quantization‑Aware Training and offering a full toolchain for local deployment.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Google Pushes Full Throttle: Run Gemma 4 Large Models Locally with MTP Acceleration

Overview

Google has accelerated the Gemma 4 series with a Quantization‑Aware Training (QAT) checkpoint that fits inside 1 GB for the E2B variant and runs a 26B MoE model on a laptop with only 16 GB of RAM, opening the door for true local deployment of large language models.

Key Highlights

E2B 1 GB runtime : By removing per‑layer embeddings, the pure‑text version occupies less than 1 GB, making inference feasible on mobile devices.

26B MoE on a notebook : Previously requiring a workstation, the 26B‑A4B model now runs on a 16 GB MacBook (Unsloth measured 15 GB usage).

Quality largely retained : QAT reduces memory while keeping accuracy higher than post‑training quantization (PTQ).

Mobile‑Specific Quantization Scheme

Google designed a dedicated quantization schema for mobile, addressing four dimensions:

Static activations : Scale parameters are pre‑computed during training, eliminating a runtime computation step on the device.

Channel‑wise quantization : Aligns compressed data structures with mobile accelerator hardware, allowing native execution without a slow compatibility path.

Targeted 2‑bit quantization : Aggressively quantizes token‑generation layers to 2‑bit while preserving higher precision in core inference layers.

Embedding and KV‑cache optimization : Separately compresses the vocabulary and short‑term memory to prevent memory blow‑up in long conversations.

The principle is to “save wherever possible while keeping the essential parts untouched,” resulting in fast inference on phone chips.

Installation

The quickest way to try the models is with Ollama:

# Edge‑small models (laptop/phone)
ollama run gemma4:e2b
ollama run gemma4:e4b

# Workstation inference
ollama run gemma4:12b
ollama run gemma4:26b
ollama run gemma4:31b

LM Studio also supports immediate download, and the GGUF format can be fed directly to llama.cpp or vLLM. Non‑quantized checkpoints can be converted to the desired format.

Recommended Runtime Parameters

temperature = 1.0
top_p = 0.95
top_k = 64

Context length depends on the model:

E2B, E4B: 128 K tokens

12B, 26B‑A4B, 31B: 256 K tokens

Hardware requirements (RAM + VRAM) reported by Unsloth:

Gemma 4 E2B QAT – 3 GB

Gemma 4 E4B QAT – 5 GB

Gemma 4 12B QAT – 7 GB

Gemma 4 26B‑A4B QAT – 15 GB

Gemma 4 31B QAT – 18 GB

Practical implications:

16 GB MacBook can run the 26B MoE model.

24 GB GPU can handle the 31B dense model.

A 16 GB Mac mini can comfortably run the 12B model with headroom.

Toolchain Overview

llama.cpp / GGUF users : Pull the Q4_0 GGUF checkpoint.

Desktop GUI : LM Studio, Ollama.

Edge deployment : LiteRT‑LM (Google’s lightweight runtime).

Browser inference : Transformers.js + ONNX.

Service deployment : vLLM, SGLang.

Apple Silicon : MLX community build.

Fine‑tuning : Hugging Face Transformers + Unsloth.

Unsloth’s Additional Optimizations

Unsloth discovered that directly converting the official QAT BF16 checkpoint to llama.cpp ’s Q4_0 format drops accuracy because llama.cpp uses an F16 scale while QAT training uses BF16 scale.

For the 26B‑A4B model, naïve conversion yields a top‑1 accuracy of 70.2 %, whereas Unsloth’s dynamic quantization alignment reaches 85.6 % (a 15.6‑point gain) while reducing file size by 200 MB.

Size reductions across models are roughly 72 % compared to the original BF16 checkpoints:

E2B: 2.62 GB vs 9.31 GB (71.86 % saved)

E4B: 4.22 GB vs 15.1 GB (72.05 % saved)

12B: 6.72 GB vs 23.8 GB (71.76 % saved)

26B‑A4B: 14.2 GB vs 50.5 GB (71.88 % saved)

31B: 17.3 GB vs 61.4 GB (71.82 % saved)

Unsloth labels its refined format UD‑Q4_K_XL, which offers better quality than the plain Q4_0.

Takeaways

Local LLM enthusiasts should upgrade to the new QAT models, especially on 16 GB machines where 26B‑level inference becomes feasible.

Mobile developers can achieve sub‑1 GB memory footprints with the mobile‑specific QAT and LiteRT‑LM.

Do not directly convert QAT weights to llama.cpp Q4_0; use the official GGUF or Unsloth’s UD series to avoid accuracy loss.

The mobile‑only format currently ties to LiteRT‑LM; other runtimes still rely on the generic Q4_0 version.

E2B/E4B excel in low‑memory scenarios but lag behind larger models in raw performance; choose models wisely rather than being swayed solely by the 1 GB claim.

Overall, Gemma 4’s QAT update is the most substantial improvement to date, turning “small models can run” into “large models can run on your computer,” especially when combined with prior MTP and 12B updates.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MTPQuantization-Aware TrainingLocal LLM DeploymentUnslothGemma 4Mobile Quantization
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.