Ollama 0.19 Boosts Apple Silicon LLM Inference with MLX Engine and NVFP4

Ollama 0.19 replaces its inference backend with Apple’s MLX framework and adopts NVIDIA’s NVFP4 4‑bit quantization, delivering up to a 93% speed increase on M5 chips while keeping accuracy comparable to cloud‑based deployments, and adds three cache upgrades for smoother agent interactions.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Ollama 0.19 Boosts Apple Silicon LLM Inference with MLX Engine and NVFP4

MLX Engine Replacement

Ollama 0.19 replaces the llama.cpp backend with Apple’s MLX framework, which runs on the Unified Memory Architecture of Apple Silicon so that CPU and GPU share the same memory pool, eliminating data copies.

Official benchmarks on an M5 chip with the Qwen3.5‑35B‑A3B model show:

Prefill throughput: 1,810 tokens/s (previous 1,154 tokens/s) – a 57 % gain.

Decode throughput: 112 tokens/s (previous 58 tokens/s) – a 93 % gain.

With int4 quantization, prefill reaches 1,851 tokens/s and decode 134 tokens/s.

NVFP4 4‑bit Quantization

NVFP4 is NVIDIA’s 4‑bit floating‑point format built on the Blackwell GPU architecture. It uses a fine‑grained micro‑block scaling strategy (16 values share an FP8 scaling factor) and a two‑level scaling scheme (micro‑block FP8 plus tensor‑level FP32) to reduce quantization error.

Accuracy comparison (higher is better):

MMLU‑PRO: FP8 85 % vs NVFP4 84 % (‑1 %).

GPQA Diamond: FP8 81 % vs NVFP4 80 % (‑1 %).

Math‑500: FP8 98 % vs NVFP4 98 % (0 %).

AIME 2024: FP8 89 % vs NVFP4 91 % (+2 %).

These results indicate that local inference with NVFP4 produces quality comparable to cloud deployments that use TensorRT‑LLM or vLLM.

Cache System Upgrades for Agent Workloads

Lower memory usage : cross‑session cache reuse allows multiple coding‑agent sessions to share system‑prompt caches.

Smart checkpoints : cache snapshots are saved at key prompt positions, increasing the probability of cache hits on subsequent requests.

Intelligent eviction : caches that contain shared prefixes are preserved even when older branches are cleared, extending their lifetime.

These changes raise cache‑hit rates for coding agents and multi‑turn dialogs, reducing overall response latency.

Getting Started

Download the preview build from https://ollama.com/download.

ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4
ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4
ollama run qwen3.5:35b-a3b-coding-nvfp4

The suffix nvfp4 indicates the NVFP4 quantization format.

Community Benchmark: M5 Max vs M4 Max

The open‑source inference‑speed‑tests tool (based on mlx‑lm) was used to compare two 16‑inch, 128 GB, 40‑core‑GPU Macs.

Short prompts (512‑token limit): M5 Max is 14‑42 % faster in prompt processing and 14‑17 % faster in generation throughput.

Long prompts (~21 K tokens): generation speed is similar, but prompt processing on M5 Max is 2‑3× faster, dramatically improving first‑token latency for agent scenarios.

Running the benchmark tool

# Clone the repository
git clone https://github.com/itsmostafa/inference-speed-tests
cd inference-speed-tests
uv sync

# Run a single benchmark
uv run src/main.py mlx-community/Qwen3-8B-4bit -n 1

# Compare multiple models
uv run src/main.py mlx-community/Qwen3-8B-4bit mlx-community/Qwen3-14B-4bit

# Long‑text stress test
uv run src/main.py mlx-community/Qwen3-8B-4bit \
    --dataset cnn_dailymail --dataset-config 3.0.0 --dataset-field article

Results are saved under the results/ directory and include prompt TPS, generation TPS, TTFT, peak memory, total time, etc.

Summary

MLX engine replacement yields double‑digit speed improvements on Apple Silicon.

NVFP4 4‑bit quantization keeps accuracy loss minimal, aligning local results with cloud deployments.

Cache system overhaul makes coding agents and multi‑turn dialogs noticeably smoother.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PerformanceLLM inferenceOllamaApple SiliconMLXNVFP4
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.