Can Gemma 4 on a MacBook Pro or NVIDIA Blackwell Replace Cloud LLMs? A Hands‑On Performance Study
The author benchmarks Gemma 4 locally on a 24 GB M4 Pro MacBook Pro (llama.cpp) and on a Dell GB10 with an NVIDIA Blackwell GPU (Ollama), comparing token speed, tool‑call reliability, and task completion against cloud GPT‑5.4, showing the Mac runs faster per token but the Blackwell system achieves higher first‑pass success with fewer retries, and that the jump from Gemma 3 to Gemma 4 dramatically improves agentic coding viability.
Why I Need It
Three reasons drive the shift to a local model: cost, because heavy daily use of Codex CLI incurs substantial API fees; privacy, since some codebases must stay on‑premise; and flexibility, as cloud APIs can be rate‑limited, experience downtime, or change pricing. The author also notes that earlier Gemma versions failed at tool calls (only 6.6% success on the tau2‑bench benchmark), making them unusable for agentic coding.
How to Build a Usable Runtime
MacBook Pro. The author first tried Ollama, but version 0.20.3 streamed responses to the wrong field ( tool_calls) and suffered a Flash‑Attention freeze on prompts longer than ~500 tokens. Switching to llama.cpp required six command‑line parameters:
llama-server \
-m /path/to/gemma-4-26B-A4B-it-Q4_K_M.gguf \
--port 1234 -ngl 99 -c 32768 -np 1 --jinja \
-ctk q8_0 -ctv q8_0Key flags: -np 1 limits slots to one to control KV‑cache memory; -ctk q8_0 -ctv q8_0 quantises the cache from 940 MB to 499 MB; --jinja enables Gemma 4’s tool‑call template; and using -m avoids the silent download of a 1.1 GB visual projector triggered by -hf, which would overflow memory.
The Codex CLI config also needed web_search = "disabled" because the CLI sends a web_search_preview tool type that llama.cpp cannot recognise. The author solved this by iteratively adjusting one parameter at a time while consulting GitHub issues.
GB10 (Dell Pro Max). The initial plan to use vLLM failed due to an ABI mismatch: vLLM 0.19.0 was built against PyTorch 2.10.0, but the Blackwell platform only supports PyTorch 2.11.0+cu128, causing an ImportError. After compiling CUDA‑enabled llama.cpp from source, the author used Ollama v0.20.5, which ran successfully. The workflow involved pulling the model via SSH tunnel ( ollama pull gemma4:31b) and invoking it with codex --oss -m gemma4:31b, achieving a single successful text generation and tool call.
Benchmark
The same task—writing a Python function parse_csv_summary with error handling and running its tests—was executed on all three configurations using codex exec --full-auto. This is a practical spot check rather than a statistically rigorous benchmark.
Results:
Cloud GPT‑5.4 produced typed code with robust exception handling; all five tests passed in 65 seconds.
GB10’s 31 B dense model generated untyped but functionally correct code; after three tool calls, all tests passed in 7 minutes.
Mac’s 26 B MoE model emitted redundant code and required ten tool calls and multiple retries; it took 4 minutes 42 seconds, with several syntax errors (e.g., misspelled file_path as filerypt, stray spaces in encoding=' 'utf-8', and incorrect fileint(file_path)).
Performance Data and Why Mac Speed Exceeded Expectations
Running llama-bench with identical context lengths showed the Mac generating tokens 5.1× faster than the GB10, despite both having 273 GB/s LPDDR5X memory bandwidth. The MoE architecture explains the gap: the 31 B dense model reads all 31.2 B parameters per token (≈17.4 GB), while the 26 B MoE model activates only 3.8 B parameters (≈1.9 GB) after Q4 quantisation. The Mac therefore processes 1.9 GB per token at 52 tok/s, whereas the GB10 processes 17.4 GB per token at 10 tok/s.
Prompt processing also differed: with an 8 K context, the Mac achieved 531 tok/s versus 548 tok/s on the GB10, indicating that sparse activation benefits both generation and prompt handling.
What Changed My Mind
Initially the author assumed token‑generation speed would dominate the experience. Although the Mac was 5.1× faster per token, the overall task completed only 30 % sooner (4 min 42 s vs. 6 min 59 s) because the Mac required many more retries and tool calls. The cloud model completed all attempts in 65 seconds with perfect first‑pass reliability, highlighting that reliability can outweigh raw speed.
The quality jump from Gemma 3 (6.6 % tool‑call success) to Gemma 4 (86.4 %) made local agentic coding feasible. The author notes that the observed performance depends on the specific Q4_K_M quantisation and 24 GB memory constraints; results may differ on larger Apple Silicon machines or with higher quantisation levels.
If You Plan to Try
Practical tips:
On Apple Silicon, use --jinja with llama.cpp instead of Ollama.
Set web_search = "disabled" in the Codex CLI config.
Specify the GGUF file with -m, not -hf, to avoid unwanted downloads.
Use a context window of 32,768 tokens (required by the system prompt) and quantise the KV cache with -ctk q8_0 -ctv q8_0.
On the GB10, run codex --oss -m gemma4:31b and forward port 11434 via SSH if the device is remote.
Increase stream_idle_timeout_ms to at least 1,800,000 to prevent premature termination of long tool‑call cycles (the Mac averages 1 min 39 s per cycle).
Lock the llama.cpp build version; different builds can cause up to a 3.3× performance regression.
Benchmarks were run on 12 April 2026 using Codex CLI v0.120.0. Mac environment: 24 GB M4 Pro MacBook Pro, llama.cpp ggml 0.9.11 (build 8680), model gemma‑4‑26B‑A4B‑it Q4_K_M. GB10 environment: Dell Pro Max GB10, 128 GB memory, NVIDIA Blackwell, Ollama v0.20.5, model gemma‑4‑31B‑it Q4_K_M. Cloud baseline: GPT‑5.4 with high inference complexity.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
