How to Run MiniMax‑M2.7 on Mac: Comparing Two Quantization Paths

This article explains why standard uniform quantization fails for the 228‑billion‑parameter MiniMax‑M2.7 MoE model on macOS, and compares two practical solutions—JANGTQ + MLX Studio with 2‑bit mixed‑precision achieving 91.5 % MMLU using 56.5 GB, and LM Studio + GGUF which is easier but requires at least 138 GB RAM and yields lower accuracy.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
How to Run MiniMax‑M2.7 on Mac: Comparing Two Quantization Paths

Overview

MiniMax-M2.7 is a 228.7 B‑parameter Mixture‑of‑Experts (MoE) language model with a 192K context window. Each token activates roughly 10 B parameters. Reported benchmark scores include SWE‑Pro 56.22 % and MLE Bench Lite 66.6 %.

Why standard MLX uniform quantization fails

Uniform quantization of the entire model in the MLX ecosystem reduces MMLU accuracy to about 25 % because the router gate is also quantized, causing tokens to be routed to incorrect experts.

Path 1 – JANGTQ + MLX Studio (recommended)

JANGTQ (JANG TurboQuant) is a mixed‑precision quantization scheme that keeps the router gate, attention layers, and shared experts at 8‑bit or fp16 while compressing the expert MLP (≈98 % of parameters) with a 2‑bit codebook and Hadamard rotation.

MiniMax-M2.7 architecture
MiniMax-M2.7 architecture

Installation and inference example

pip install jang-tools
from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model_path = snapshot_download("JANGQ-AI/MiniMax-M2.7-JANGTQ")
model, tokenizer = load_jangtq_model(model_path)

messages = [{"role":"user","content":"用5句话解释光合作用"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, tokenizer, prompt, max_tokens=600, verbose=True)

# Strip reasoning chain
if "</think>" in out:
    out = out.split("</think>")[-1].strip()
print(out)

Hardware requirements and performance

Minimum RAM: 64 GB (96 GB recommended) on Apple Silicon.

Disk footprint: 56.5 GB.

MMLU (200‑question) ≈ 91.5 %.

Speed on M3 Ultra: ~44 tokens/s.

Performance per Apple Silicon model:

M3 Ultra / M2 Ultra (96 GB RAM) – ~44 tok/s.

M4 Max (96 GB RAM) – ~35‑40 tok/s.

M4 Pro (64 GB RAM) – ~25‑30 tok/s (tight).

Path 2 – LM Studio + GGUF (simpler)

LM Studio ships a pre‑quantized GGUF version of MiniMax‑M2.7 built with llama.cpp b8778. The GGUF file is available from lmstudio-community/MiniMax-M2.7-GGUF.

Default generation parameters:

Temperature = 1.0 (required).

Top K = 40.

Top P = 0.95.

Steps

Download and install LM Studio from https://lmstudio.ai/download.

Search for minimax/minimax-m2.7 and select the GGUF version.

Set the parameters above.

Start a chat.

LM Studio reports a minimum system memory requirement of 138 GB. On a 96 GB Mac the model can run but MMLU drops to roughly 64‑65 % for the 4‑bit version.

Comparison

Disk usage : JANGTQ 56.5 GB vs GGUF ≈ 108 GB.

Minimum RAM : JANGTQ 64 GB vs GGUF 138 GB.

MMLU quality : JANGTQ 91.5 % vs GGUF ~64‑65 % (4‑bit).

Speed on M3 Ultra : JANGTQ ~44 tok/s; GGUF not yet measured.

Ease of use : JANGTQ requires installing jang-tools; GGUF works out‑of‑the‑box.

Ecosystem compatibility : JANGTQ integrates with the MLX ecosystem; GGUF provides an OpenAI‑compatible API.

Path comparison
Path comparison

Key settings reminder

Temperature must be set to 1.0 – a temperature of 0 causes the always‑reasoning chain to loop indefinitely inside <think> tags.

max_tokens ≥ 8192 – the always‑reasoning mode needs sufficient token budget.

System RAM must exceed the model file size – otherwise the model swaps to disk, causing a drastic speed drop.

Conclusion

For local deployment of MiniMax‑M2.7 on Apple Silicon, JANGTQ + MLX Studio offers the smallest footprint (56.5 GB) and the highest quality (2‑bit quantization achieving 91.5 % MMLU). LM Studio provides a more user‑friendly, out‑of‑the‑box experience but requires substantially more memory and yields lower accuracy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

quantizationMacLM StudioJANGTQMiniMax M2.7MLX Studio
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.