Testing the Hot oMLX on Mac: Claude‑Opus‑4.6 Distilled and Qwen3.5‑9B Performance Review

The article evaluates oMLX, a Mac‑only LLM runtime built on Apple Silicon and MLX, by walking through installation, UI features, memory usage, single‑request speed, benchmark results for Claude‑Opus‑4.6 and Qwen3.5‑9B, continuous batch processing gains, Claude Code optimizations, multi‑model support, and the failure to run a 27B model.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Testing the Hot oMLX on Mac: Claude‑Opus‑4.6 Distilled and Qwen3.5‑9B Performance Review

Overview of oMLX

oMLX runs on Apple Silicon using the MLX library and targets users who want a local LLM experience on macOS. It is positioned as an alternative to LM‑Studio, promising faster performance and lower memory consumption.

Key Highlights

Polished UI with menu bar, dashboard, and chat page.

Underlying SSD KV cache, hot‑cache settings, MCP support, one‑click integration with AI coding agents, and OpenAI/Anthropic compatible interfaces optimized for Claude Code.

Single‑request generation speed ≈ 20 tokens/s; peak RAM usage ≈ 5.7 GB.

Unable to run the Qwen3.5‑27B‑Claude‑4.6‑Opus‑Distilled‑MLX‑4bit model; LM‑Studio can load it but execution stalls.

Installation and Configuration

After installing oMLX, open the Preferences panel to set the model directory and server port. The author moved the model folder to an external hard drive. The UI provides one‑click start/stop of the server, access to the management dashboard, and a chat interface.

Model management allows browsing available models and checking compatibility with the host. The author experienced very slow download speeds with the built‑in downloader and switched to ModelScope, noting that the downloader defaults to fetching all precision variants instead of the desired Q4 version.

Official Benchmark Results

Model load time: 2.4 seconds

Prompt ingestion: 86.5 tokens/s

Generation speed: 15.7 tokens/s

Peak RAM usage: 15.6 GB

Bit‑rate: 4.501 bits/weight

Final size: 14 GB (3 shards)

Single‑Request Tests

TTFT (time‑to‑first‑token) for a 32 K token request was high, and overall throughput was only 11 tokens/s. When the input length was extended to 4096 tokens, TTFT rose from 4.8 s to 18.8 s, while throughput remained around 19.8 tokens/s and peak memory grew from 5.66 GB to 6.40 GB.

Concurrent requests of 2–4 increased total throughput noticeably, but scaling to 8 concurrent streams approached the platform limit and caused large latency penalties.

Continuous Batch Processing

Based on the mlx‑lm BatchGenerator, benchmarks on an M3 Ultra (512 GB) with Qwen3.5‑122B‑A10B‑4bit showed:

Prompt processing speed (single request, 8 k context): 941 tok/s

Token generation speed: 54.0 tok/s

8× continuous batch throughput: 190.2 tok/s (3.36× improvement)

Peak memory usage: 73 GB

For the Qwen3‑Coder‑Next‑8bit model (8 k context):

Prompt processing speed: 2009 tok/s

8× batch throughput: 243.3 tok/s (4.14× improvement)

Peak memory usage: 85 GB

Claude Code Optimizations

Supports context scaling for smaller‑context models, automatically compressing when appropriate, and provides SSE keep‑alive to avoid read‑timeout during long pre‑fill phases.

Two official directions are offered:

Use context scaling to trigger automatic compression at the right moment.

Enable SSE keep‑alive to reduce the risk of timeout during prolonged pre‑fill.

Additional supported features:

OpenAI‑compatible endpoint: http://localhost:8000/v1 Anthropic‑compatible endpoint: POST /v1/messages Tool calling

MCP integration

Multi‑Model Service

oMLX can host multiple model types within a single service:

Text LLM

Vision LLM (VLM)

OCR model

Embedding model

Reranker

27B Model Issue

The author attempted numerous configuration changes but could not get the 27 B model to run, concluding that a Mac with at least 32 GB unified memory is required. LM‑Studio can load the model but execution stalls, causing the machine to freeze.

Conclusion

oMLX offers a visually appealing, feature‑rich environment for running Apple‑silicon LLMs locally, with competitive single‑request speeds and notable gains from continuous batch processing. However, memory constraints limit the ability to run very large models (27 B), and some download and configuration quirks remain.

benchmarkMaclocal LLMApple SiliconMLXClaude OpusQwen3.5oMLX
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.