Ollama 0.19 Boosts Apple Silicon LLM Inference with MLX Engine and NVFP4
Ollama 0.19 replaces its inference backend with Apple’s MLX framework and adopts NVIDIA’s NVFP4 4‑bit quantization, delivering up to a 93% speed increase on M5 chips while keeping accuracy comparable to cloud‑based deployments, and adds three cache upgrades for smoother agent interactions.
MLX Engine Replacement
Ollama 0.19 replaces the llama.cpp backend with Apple’s MLX framework, which runs on the Unified Memory Architecture of Apple Silicon so that CPU and GPU share the same memory pool, eliminating data copies.
Official benchmarks on an M5 chip with the Qwen3.5‑35B‑A3B model show:
Prefill throughput: 1,810 tokens/s (previous 1,154 tokens/s) – a 57 % gain.
Decode throughput: 112 tokens/s (previous 58 tokens/s) – a 93 % gain.
With int4 quantization, prefill reaches 1,851 tokens/s and decode 134 tokens/s.
NVFP4 4‑bit Quantization
NVFP4 is NVIDIA’s 4‑bit floating‑point format built on the Blackwell GPU architecture. It uses a fine‑grained micro‑block scaling strategy (16 values share an FP8 scaling factor) and a two‑level scaling scheme (micro‑block FP8 plus tensor‑level FP32) to reduce quantization error.
Accuracy comparison (higher is better):
MMLU‑PRO: FP8 85 % vs NVFP4 84 % (‑1 %).
GPQA Diamond: FP8 81 % vs NVFP4 80 % (‑1 %).
Math‑500: FP8 98 % vs NVFP4 98 % (0 %).
AIME 2024: FP8 89 % vs NVFP4 91 % (+2 %).
These results indicate that local inference with NVFP4 produces quality comparable to cloud deployments that use TensorRT‑LLM or vLLM.
Cache System Upgrades for Agent Workloads
Lower memory usage : cross‑session cache reuse allows multiple coding‑agent sessions to share system‑prompt caches.
Smart checkpoints : cache snapshots are saved at key prompt positions, increasing the probability of cache hits on subsequent requests.
Intelligent eviction : caches that contain shared prefixes are preserved even when older branches are cleared, extending their lifetime.
These changes raise cache‑hit rates for coding agents and multi‑turn dialogs, reducing overall response latency.
Getting Started
Download the preview build from https://ollama.com/download.
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4 ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4 ollama run qwen3.5:35b-a3b-coding-nvfp4The suffix nvfp4 indicates the NVFP4 quantization format.
Community Benchmark: M5 Max vs M4 Max
The open‑source inference‑speed‑tests tool (based on mlx‑lm) was used to compare two 16‑inch, 128 GB, 40‑core‑GPU Macs.
Short prompts (512‑token limit): M5 Max is 14‑42 % faster in prompt processing and 14‑17 % faster in generation throughput.
Long prompts (~21 K tokens): generation speed is similar, but prompt processing on M5 Max is 2‑3× faster, dramatically improving first‑token latency for agent scenarios.
Running the benchmark tool
# Clone the repository
git clone https://github.com/itsmostafa/inference-speed-tests
cd inference-speed-tests
uv sync
# Run a single benchmark
uv run src/main.py mlx-community/Qwen3-8B-4bit -n 1
# Compare multiple models
uv run src/main.py mlx-community/Qwen3-8B-4bit mlx-community/Qwen3-14B-4bit
# Long‑text stress test
uv run src/main.py mlx-community/Qwen3-8B-4bit \
--dataset cnn_dailymail --dataset-config 3.0.0 --dataset-field articleResults are saved under the results/ directory and include prompt TPS, generation TPS, TTFT, peak memory, total time, etc.
Summary
MLX engine replacement yields double‑digit speed improvements on Apple Silicon.
NVFP4 4‑bit quantization keeps accuracy loss minimal, aligning local results with cloud deployments.
Cache system overhaul makes coding agents and multi‑turn dialogs noticeably smoother.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
