How to Run MiniMax‑M2.7 on Mac: Comparing Two Quantization Paths
This article explains why standard uniform quantization fails for the 228‑billion‑parameter MiniMax‑M2.7 MoE model on macOS, and compares two practical solutions—JANGTQ + MLX Studio with 2‑bit mixed‑precision achieving 91.5 % MMLU using 56.5 GB, and LM Studio + GGUF which is easier but requires at least 138 GB RAM and yields lower accuracy.
Overview
MiniMax-M2.7 is a 228.7 B‑parameter Mixture‑of‑Experts (MoE) language model with a 192K context window. Each token activates roughly 10 B parameters. Reported benchmark scores include SWE‑Pro 56.22 % and MLE Bench Lite 66.6 %.
Why standard MLX uniform quantization fails
Uniform quantization of the entire model in the MLX ecosystem reduces MMLU accuracy to about 25 % because the router gate is also quantized, causing tokens to be routed to incorrect experts.
Path 1 – JANGTQ + MLX Studio (recommended)
JANGTQ (JANG TurboQuant) is a mixed‑precision quantization scheme that keeps the router gate, attention layers, and shared experts at 8‑bit or fp16 while compressing the expert MLP (≈98 % of parameters) with a 2‑bit codebook and Hadamard rotation.
Installation and inference example
pip install jang-tools from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate
model_path = snapshot_download("JANGQ-AI/MiniMax-M2.7-JANGTQ")
model, tokenizer = load_jangtq_model(model_path)
messages = [{"role":"user","content":"用5句话解释光合作用"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, tokenizer, prompt, max_tokens=600, verbose=True)
# Strip reasoning chain
if "</think>" in out:
out = out.split("</think>")[-1].strip()
print(out)Hardware requirements and performance
Minimum RAM: 64 GB (96 GB recommended) on Apple Silicon.
Disk footprint: 56.5 GB.
MMLU (200‑question) ≈ 91.5 %.
Speed on M3 Ultra: ~44 tokens/s.
Performance per Apple Silicon model:
M3 Ultra / M2 Ultra (96 GB RAM) – ~44 tok/s.
M4 Max (96 GB RAM) – ~35‑40 tok/s.
M4 Pro (64 GB RAM) – ~25‑30 tok/s (tight).
Path 2 – LM Studio + GGUF (simpler)
LM Studio ships a pre‑quantized GGUF version of MiniMax‑M2.7 built with llama.cpp b8778. The GGUF file is available from lmstudio-community/MiniMax-M2.7-GGUF.
Default generation parameters:
Temperature = 1.0 (required).
Top K = 40.
Top P = 0.95.
Steps
Download and install LM Studio from https://lmstudio.ai/download.
Search for minimax/minimax-m2.7 and select the GGUF version.
Set the parameters above.
Start a chat.
LM Studio reports a minimum system memory requirement of 138 GB. On a 96 GB Mac the model can run but MMLU drops to roughly 64‑65 % for the 4‑bit version.
Comparison
Disk usage : JANGTQ 56.5 GB vs GGUF ≈ 108 GB.
Minimum RAM : JANGTQ 64 GB vs GGUF 138 GB.
MMLU quality : JANGTQ 91.5 % vs GGUF ~64‑65 % (4‑bit).
Speed on M3 Ultra : JANGTQ ~44 tok/s; GGUF not yet measured.
Ease of use : JANGTQ requires installing jang-tools; GGUF works out‑of‑the‑box.
Ecosystem compatibility : JANGTQ integrates with the MLX ecosystem; GGUF provides an OpenAI‑compatible API.
Key settings reminder
Temperature must be set to 1.0 – a temperature of 0 causes the always‑reasoning chain to loop indefinitely inside <think> tags.
max_tokens ≥ 8192 – the always‑reasoning mode needs sufficient token budget.
System RAM must exceed the model file size – otherwise the model swaps to disk, causing a drastic speed drop.
Conclusion
For local deployment of MiniMax‑M2.7 on Apple Silicon, JANGTQ + MLX Studio offers the smallest footprint (56.5 GB) and the highest quality (2‑bit quantization achieving 91.5 % MMLU). LM Studio provides a more user‑friendly, out‑of‑the‑box experience but requires substantially more memory and yields lower accuracy.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
