Testing the Hot oMLX on Mac: Claude‑Opus‑4.6 Distilled and Qwen3.5‑9B Performance Review
The article evaluates oMLX, a Mac‑only LLM runtime built on Apple Silicon and MLX, by walking through installation, UI features, memory usage, single‑request speed, benchmark results for Claude‑Opus‑4.6 and Qwen3.5‑9B, continuous batch processing gains, Claude Code optimizations, multi‑model support, and the failure to run a 27B model.
Overview of oMLX
oMLX runs on Apple Silicon using the MLX library and targets users who want a local LLM experience on macOS. It is positioned as an alternative to LM‑Studio, promising faster performance and lower memory consumption.
Key Highlights
Polished UI with menu bar, dashboard, and chat page.
Underlying SSD KV cache, hot‑cache settings, MCP support, one‑click integration with AI coding agents, and OpenAI/Anthropic compatible interfaces optimized for Claude Code.
Single‑request generation speed ≈ 20 tokens/s; peak RAM usage ≈ 5.7 GB.
Unable to run the Qwen3.5‑27B‑Claude‑4.6‑Opus‑Distilled‑MLX‑4bit model; LM‑Studio can load it but execution stalls.
Installation and Configuration
After installing oMLX, open the Preferences panel to set the model directory and server port. The author moved the model folder to an external hard drive. The UI provides one‑click start/stop of the server, access to the management dashboard, and a chat interface.
Model management allows browsing available models and checking compatibility with the host. The author experienced very slow download speeds with the built‑in downloader and switched to ModelScope, noting that the downloader defaults to fetching all precision variants instead of the desired Q4 version.
Official Benchmark Results
Model load time: 2.4 seconds
Prompt ingestion: 86.5 tokens/s
Generation speed: 15.7 tokens/s
Peak RAM usage: 15.6 GB
Bit‑rate: 4.501 bits/weight
Final size: 14 GB (3 shards)
Single‑Request Tests
TTFT (time‑to‑first‑token) for a 32 K token request was high, and overall throughput was only 11 tokens/s. When the input length was extended to 4096 tokens, TTFT rose from 4.8 s to 18.8 s, while throughput remained around 19.8 tokens/s and peak memory grew from 5.66 GB to 6.40 GB.
Concurrent requests of 2–4 increased total throughput noticeably, but scaling to 8 concurrent streams approached the platform limit and caused large latency penalties.
Continuous Batch Processing
Based on the mlx‑lm BatchGenerator, benchmarks on an M3 Ultra (512 GB) with Qwen3.5‑122B‑A10B‑4bit showed:
Prompt processing speed (single request, 8 k context): 941 tok/s
Token generation speed: 54.0 tok/s
8× continuous batch throughput: 190.2 tok/s (3.36× improvement)
Peak memory usage: 73 GB
For the Qwen3‑Coder‑Next‑8bit model (8 k context):
Prompt processing speed: 2009 tok/s
8× batch throughput: 243.3 tok/s (4.14× improvement)
Peak memory usage: 85 GB
Claude Code Optimizations
Supports context scaling for smaller‑context models, automatically compressing when appropriate, and provides SSE keep‑alive to avoid read‑timeout during long pre‑fill phases.
Two official directions are offered:
Use context scaling to trigger automatic compression at the right moment.
Enable SSE keep‑alive to reduce the risk of timeout during prolonged pre‑fill.
Additional supported features:
OpenAI‑compatible endpoint: http://localhost:8000/v1 Anthropic‑compatible endpoint: POST /v1/messages Tool calling
MCP integration
Multi‑Model Service
oMLX can host multiple model types within a single service:
Text LLM
Vision LLM (VLM)
OCR model
Embedding model
Reranker
27B Model Issue
The author attempted numerous configuration changes but could not get the 27 B model to run, concluding that a Mac with at least 32 GB unified memory is required. LM‑Studio can load the model but execution stalls, causing the machine to freeze.
Conclusion
oMLX offers a visually appealing, feature‑rich environment for running Apple‑silicon LLMs locally, with competitive single‑request speeds and notable gains from continuous batch processing. However, memory constraints limit the ability to run very large models (27 B), and some download and configuration quirks remain.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
