Artificial Intelligence 8 min read

How to Run MiniMax‑M2.7 on Mac: Comparing Two Quantization Paths

This article explains why standard uniform quantization fails for the 228‑billion‑parameter MiniMax‑M2.7 MoE model on macOS, and compares two practical solutions—JANGTQ + MLX Studio with 2‑bit mixed‑precision achieving 91.5 % MMLU using 56.5 GB, and LM Studio + GGUF which is easier but requires at least 138 GB RAM and yields lower accuracy.

Old Zhang's AI Learning

Apr 18, 2026

How to Run MiniMax‑M2.7 on Mac: Comparing Two Quantization Paths

Overview

MiniMax-M2.7 is a 228.7 B‑parameter Mixture‑of‑Experts (MoE) language model with a 192K context window. Each token activates roughly 10 B parameters. Reported benchmark scores include SWE‑Pro 56.22 % and MLE Bench Lite 66.6 %.

Why standard MLX uniform quantization fails

Uniform quantization of the entire model in the MLX ecosystem reduces MMLU accuracy to about 25 % because the router gate is also quantized, causing tokens to be routed to incorrect experts.

Path 1 – JANGTQ + MLX Studio (recommended)

JANGTQ (JANG TurboQuant) is a mixed‑precision quantization scheme that keeps the router gate, attention layers, and shared experts at 8‑bit or fp16 while compressing the expert MLP (≈98 % of parameters) with a 2‑bit codebook and Hadamard rotation.

Installation and inference example

pip install jang-tools

from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model_path = snapshot_download("JANGQ-AI/MiniMax-M2.7-JANGTQ")
model, tokenizer = load_jangtq_model(model_path)

messages = [{"role":"user","content":"用5句话解释光合作用"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, tokenizer, prompt, max_tokens=600, verbose=True)

# Strip reasoning chain
if "</think>" in out:
    out = out.split("</think>")[-1].strip()
print(out)

Hardware requirements and performance

Minimum RAM: 64 GB (96 GB recommended) on Apple Silicon.

Disk footprint: 56.5 GB.

MMLU (200‑question) ≈ 91.5 %.

Speed on M3 Ultra: ~44 tokens/s.

Performance per Apple Silicon model:

M3 Ultra / M2 Ultra (96 GB RAM) – ~44 tok/s.

M4 Max (96 GB RAM) – ~35‑40 tok/s.

M4 Pro (64 GB RAM) – ~25‑30 tok/s (tight).

Path 2 – LM Studio + GGUF (simpler)

LM Studio ships a pre‑quantized GGUF version of MiniMax‑M2.7 built with llama.cpp b8778. The GGUF file is available from lmstudio-community/MiniMax-M2.7-GGUF.

Default generation parameters:

Temperature = 1.0 (required).

Top K = 40.

Top P = 0.95.

Steps

Download and install LM Studio from https://lmstudio.ai/download.

Search for minimax/minimax-m2.7 and select the GGUF version.

Set the parameters above.

Start a chat.

LM Studio reports a minimum system memory requirement of 138 GB. On a 96 GB Mac the model can run but MMLU drops to roughly 64‑65 % for the 4‑bit version.

Comparison

Disk usage : JANGTQ 56.5 GB vs GGUF ≈ 108 GB.

Minimum RAM : JANGTQ 64 GB vs GGUF 138 GB.

MMLU quality : JANGTQ 91.5 % vs GGUF ~64‑65 % (4‑bit).

Speed on M3 Ultra : JANGTQ ~44 tok/s; GGUF not yet measured.

Ease of use : JANGTQ requires installing jang-tools; GGUF works out‑of‑the‑box.

Ecosystem compatibility : JANGTQ integrates with the MLX ecosystem; GGUF provides an OpenAI‑compatible API.

Key settings reminder

Temperature must be set to 1.0 – a temperature of 0 causes the always‑reasoning chain to loop indefinitely inside <think> tags.

max_tokens ≥ 8192 – the always‑reasoning mode needs sufficient token budget.

System RAM must exceed the model file size – otherwise the model swaps to disk, causing a drastic speed drop.

Conclusion

For local deployment of MiniMax‑M2.7 on Apple Silicon, JANGTQ + MLX Studio offers the smallest footprint (56.5 GB) and the highest quality (2‑bit quantization achieving 91.5 % MMLU). LM Studio provides a more user‑friendly, out‑of‑the‑box experience but requires substantially more memory and yields lower accuracy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

quantization Mac LM Studio JANGTQ MiniMax M2.7 MLX Studio

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview

Why standard MLX uniform quantization fails

Path 1 – JANGTQ + MLX Studio (recommended)

Installation and inference example

Hardware requirements and performance

Path 2 – LM Studio + GGUF (simpler)

Steps

Comparison

Key settings reminder

Conclusion

Old Zhang's AI Learning

How this landed with the community

Was this worth your time?

0 Comments

Path 1 – JANGTQ + MLX Studio (recommended)

Path 2 – LM Studio + GGUF (simpler)