MLX — 9 Technical Articles

Apr 12, 2026 · Artificial Intelligence

How to Deploy MiniMax-M2.7 Quantized Models Locally on macOS and Linux

This guide explains the 22 GGUF quantized versions of MiniMax-M2.7 released by Unsloth, compares their accuracy and size, recommends the UD‑Q4_K_XL model for best quality‑to‑size trade‑off, and provides step‑by‑step instructions for local deployment via Unsloth Studio, llama.cpp, API server, or the MLX native solution, along with important pitfalls and performance‑tuning tips.

Dynamic 2.0Local DeploymentMLX

0 likes · 14 min read

How to Deploy MiniMax-M2.7 Quantized Models Locally on macOS and Linux

Old Zhang's AI Learning

Apr 6, 2026 · Artificial Intelligence

Ollama 0.19 Boosts Apple Silicon LLM Inference with MLX Engine and NVFP4

Ollama 0.19 replaces its inference backend with Apple’s MLX framework and adopts NVIDIA’s NVFP4 4‑bit quantization, delivering up to a 93% speed increase on M5 chips while keeping accuracy comparable to cloud‑based deployments, and adds three cache upgrades for smoother agent interactions.

Apple SiliconLLM inferenceMLX

0 likes · 10 min read

Ollama 0.19 Boosts Apple Silicon LLM Inference with MLX Engine and NVFP4

Old Zhang's AI Learning

Apr 4, 2026 · Artificial Intelligence

Deploy Gemma 4 Locally: Ollama, llama.cpp, MLX, vLLM + TurboQuant Optimization

The article reviews the four Gemma 4 model variants, analyzes their architecture and benchmark results versus Qwen3.5, and provides step‑by‑step instructions for local deployment using Ollama, llama.cpp, MLX and vLLM, while highlighting TurboQuant memory and weight compression techniques.

AI benchmarkingGemma 4Local Deployment

0 likes · 15 min read

Deploy Gemma 4 Locally: Ollama, llama.cpp, MLX, vLLM + TurboQuant Optimization

Old Zhang's AI Learning

Mar 19, 2026 · Artificial Intelligence

Testing the Hot oMLX on Mac: Claude‑Opus‑4.6 Distilled and Qwen3.5‑9B Performance Review

The article evaluates oMLX, a Mac‑only LLM runtime built on Apple Silicon and MLX, by walking through installation, UI features, memory usage, single‑request speed, benchmark results for Claude‑Opus‑4.6 and Qwen3.5‑9B, continuous batch processing gains, Claude Code optimizations, multi‑model support, and the failure to run a 27B model.

Apple SiliconClaude OpusMLX

0 likes · 9 min read

Testing the Hot oMLX on Mac: Claude‑Opus‑4.6 Distilled and Qwen3.5‑9B Performance Review

AI Engineering

Feb 27, 2026 · Artificial Intelligence

jina-grep: Adding Semantic Search Capabilities to Grep on Apple Silicon

Jina-grep is an open-source CLI that adds fast, MLX-powered semantic search to grep on Apple Silicon, offering three modes, sub-millisecond latency, high token throughput, and easy installation, making local code and log searching more accurate than keyword matching.

Apple SiliconCLIMLX

0 likes · 4 min read

jina-grep: Adding Semantic Search Capabilities to Grep on Apple Silicon

AI Engineering

Feb 15, 2026 · Artificial Intelligence

Qwen3‑ASR Runs Natively on Apple Silicon via MLX for Full‑Speed Speech Recognition

A developer has re‑implemented the state‑of‑the‑art Qwen3‑ASR model in MLX, enabling native execution on Apple M1‑M4 chips with real‑time factors as low as 0.08, 4‑bit quantization speedups of 4.7×, multilingual support for 52 languages, and features such as word‑level timestamps and streaming transcription.

Apple SiliconMLXQuantization

0 likes · 5 min read

Qwen3‑ASR Runs Natively on Apple Silicon via MLX for Full‑Speed Speech Recognition

Old Zhang's AI Learning

Jan 29, 2026 · Artificial Intelligence

Exploring Kimi K2.5 Quantized Models: Deployment Tips, Hardware Requirements, and Performance Benchmarks

The article reviews the newly released quantized versions of the Kimi K2.5 large language model, detailing hardware needs, recommended quantization levels, deployment steps on Apple MLX and Inferencer, performance numbers, and the model's hybrid thinking mode.

InferencerKimi K2.5LLM deployment

0 likes · 5 min read

Exploring Kimi K2.5 Quantized Models: Deployment Tips, Hardware Requirements, and Performance Benchmarks

AI Engineering

Jan 7, 2026 · Artificial Intelligence

Unsloth-MLX: Fine‑Tune LLMs on Mac and Seamlessly Move Code to Cloud GPUs

Unsloth‑MLX leverages Apple’s MLX framework to let Mac users with Apple Silicon fine‑tune large language models locally with a single import change, offering zero‑cost migration to cloud GPUs, supporting SFT, DPO, ORPO, GRPO training, and export to HuggingFace or GGUF formats.

Apple SiliconGPU cloudLLM fine-tuning

0 likes · 4 min read

Unsloth-MLX: Fine‑Tune LLMs on Mac and Seamlessly Move Code to Cloud GPUs

Architect's Alchemy Furnace

May 7, 2025 · Artificial Intelligence

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

This article provides a comprehensive comparison of seven popular large‑language‑model inference engines—Transformers, vLLM, Llama.cpp, SGLang, MLX, Ollama and others—detailing their core features, performance characteristics, hardware compatibility, concurrency support, and ideal use‑cases, plus practical installation guidance for Xinference.

InferenceLLMMLX

0 likes · 17 min read

Which LLM Inference Engine Reigns Supreme? A Deep Dive into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama