Turning Your Mac into a Private AI Workstation with Cider and Mano‑P

The article analyzes how Ollama's shift to Apple’s MLX framework unlocks major speed gains on M5‑class Macs, then introduces the open‑source Cider inference accelerator and Mano‑P visual agent, detailing their quantization modes, benchmark results, hardware constraints, and how together they enable fast, offline private AI on macOS.

Machine Heart
Machine Heart
Machine Heart
Turning Your Mac into a Private AI Workstation with Cider and Mano‑P

Ollama switches to MLX on macOS

In March 2026 Ollama announced that its Mac inference engine will replace llama.cpp with Apple’s MLX framework. On Apple Silicon, especially M5 chips, the change yields more than a 57 % prefilling speed increase, generation speed nearly doubles, and first‑token latency (TTFT) drops to about one‑quarter of the previous value. A developer reported a 93 % decoding‑speed boost.

Why MLX is faster on Apple Silicon

Apple Silicon uses a unified memory architecture where CPU and GPU share the same physical memory, eliminating data‑movement overhead. MLX is built to exploit this architecture, giving it a low‑level advantage unavailable to traditional frameworks.

From the M5 generation onward each GPU core embeds a Neural Accelerator matrix‑multiply unit that can be accessed via Metal 4’s TensorOps API, providing dedicated AI‑inference acceleration.

Cider: quantization acceleration for MLX

W8A8 : weights and activations are quantized to INT8, enabling direct INT8 matrix multiplication on the Apple GPU via TensorOps, followed by de‑quantization to FP16 for output.

W4A8 : extends W8A8 by compressing weights further to INT4 (halving weight memory) while keeping INT8 activations; both modes use a fused kernel that merges quantization, multiplication, and de‑quantization into a single GPU dispatch.

W8A8/W4A8 activation quantization is stable on Apple M5 Pro but unsupported on M1‑M4.

Single‑operator benchmarks on a 10240 × 2560 matrix on M5 Pro show speedups over native MLX W8A16 of 1.82× (seq len 1024), 1.84× (4096) and 1.86× (8192).

In an end‑to‑end VLM prefill test with Qwen3‑VL‑2B, W8A8 accelerates prefilling by 57 %–61 %.

Accuracy loss is minimal: Qwen3‑8B perplexity rises from 9.726 (FP16) to 9.756 (W8A8), a 0.03 increase, while prefilling time drops from 179.9 s to 123.5 s (≈45 % faster). convert_model(model) W8A8 keeps both original FP16 weights and INT8 weights in memory, roughly doubling memory usage. On 16 GB devices this can cause paging; the authors recommend at least 32 GB RAM for optimal performance.

Experimental ANE + GPU heterogeneous module

An experimental module splits linear‑layer computation, assigning ~65 % of output channels to the Apple Neural Engine and the remainder to the GPU. On M4, this yields an additional 3 %–17 % speedup for Qwen3‑VL‑2B prefilling.

Mano‑P: visual GUI agent

Mano‑P is an open‑source GUI‑agent model (Apache 2.0) that perceives screen content visually and interacts with any graphical interface without relying on CDP or HTML parsing.

OSWorld benchmark: Mano‑P 1.0‑72B achieves 58.2 % success, 13 points ahead of the runner‑up.

WebRetriever Protocol I: scores 41.7, surpassing Gemini 2.5 Pro (40.9) and Claude 4.5 (31.3).

On Apple M4 Pro, a 4 B quantized model reaches 476 tokens/s prefill and 76 tokens/s decode with 4.3 GB peak memory.

When combined with Cider’s W8A8 activation quantization on M5 Pro, Mano‑P 1.0‑4B’s prefill time improves from 2.839 s to 2.519 s (≈12.7 % faster).

In a 100‑task CUA benchmark (Mano‑AFK pipeline) on a MacBook Pro M5 (16 GB), Mano‑P’s accuracy drops from 58.0 % (W8A16) to 54.0 % (W8A8) due to the memory‑doubling effect of W8A8.

Private AI paradigm

Coupling Cider’s speed with Mano‑P’s visual capabilities enables fully offline AI execution: data, inference and model capabilities remain on the user’s device, eliminating cloud calls and preserving privacy.

Repository URLs:

https://github.com/Mininglamp-AI/cider

https://github.com/Mininglamp-AI/Mano-P

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

quantizationApple SiliconMLXMano-PCiderPrivate AI
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.