Deploy the Open‑Source MiniMax‑M2.7 Model Locally: Step‑by‑Step Guide
MiniMax‑M2.7, the newly open‑sourced 230‑billion‑parameter MoE model, offers self‑evolution, professional software engineering and agent capabilities, and can be deployed locally using Ollama, vLLM, SGLang or Docker with 4‑8 H200 GPUs, while the article details hardware needs, performance gains and tool‑calling/Thinking features.
Model Highlights
Self‑evolution : after >100 autonomous optimization rounds, performance improves ~30%.
Software engineering : SWE‑Pro score 56.22%, comparable to GPT‑5.3‑Codex; production‑grade incident recovery <3 minutes.
Office abilities : GDPval‑AA ELO 1495 (highest among open‑source models); high‑fidelity multi‑turn editing of Word/Excel/PPT.
Native Agent Teams : supports multi‑agent collaboration with stable roles and autonomous decision‑making.
Code ability claimed comparable to GPT‑5.3‑Codex.
Deployment Ecosystem
Base GPU memory requirement ~230 GB. Two H200 GPUs are marginal; official recommendation is at least four H200 GPUs.
Quantized versions are under urgent development; only folder structure prepared, compression to a few dozen GB expected based on prior unSloth experience.
Ollama
Ollama 0.19 includes minimax-m2.7:cloud for free cloud inference (required because the 230‑billion‑parameter MoE model exceeds local GPU memory).
# Use with OpenClaw
ollama launch openclaw --model minimax-m2.7:cloud
# Direct chat
ollama run minimax-m2.7:cloudvLLM
vLLM v0.19.0 adds HuggingFace v5 support, multimodal optimizations, and CPU KV‑cache offloading. Provides Day‑0 support.
# Basic deployment (4 × H200/H100/A100)
vllm serve MiniMaxAI/MiniMax-M2.7 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
--enable-auto-tool-choice \
--trust-remote-code
# 8‑GPU deployment (DP+EP mode)
vllm serve MiniMaxAI/MiniMax-M2.7 \
--data-parallel-size 8 \
--enable-expert-parallel \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--enable-auto-tool-choiceDocker one‑click start :
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:minimax27 MiniMaxAI/MiniMax-M2.7 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--enable-auto-tool-choice \
--trust-remote-codeSupported platforms:
NVIDIA : 4 × H200/H100/A100 tensor parallel, or 8‑GPU DP+EP/TP+EP modes.
AMD : 2 × or 4 × MI300X/MI325X/MI350X/MI355X with AITER acceleration.
System requirements: ~220 GB GPU memory for model weights plus ~240 GB for 1 M context tokens.
SGLang
SGLang provides Day‑0 support.
sglang serve \
--model-path MiniMaxAI/MiniMax-M2.7 \
--tp 4 \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax-append-think \
--trust-remote-code \
--mem-fraction-static 0.85The minimax-append-think parser separates reasoning from final output, enabling a “Thinking” mode.
Quick test:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M2.7",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
]
}'Recommended inference parameters: temperature=1.0, top_p=0.95, top_k=40.
NVIDIA Support
Free trial endpoint: https://build.nvidia.com/minimaxai/minimax-m2.7
Inference optimizations contributed by NVIDIA and the open‑source community:
QK RMS Norm Kernel : merges compute and communication into a single kernel, reducing launch overhead and memory traffic.
FP8 MoE : integrates TensorRT‑LLM’s FP8 MoE modular kernel, tuned for MoE models.
Throughput gains on NVIDIA Blackwell Ultra GPUs:
vLLM throughput increased 2.5× (within one month).
SGLang throughput increased 2.7× (within one month).
Additional Ecosystem Components
NemoClaw : NVIDIA open‑source reference stack for continuous operation of OpenClaw.
Fine‑tuning : NeMo AutoModel enables post‑training with EP + PP schemes; NeMo RL provides GRPO reinforcement‑learning recipes for 8K and 16K sequence lengths.
# NeMo AutoModel fine‑tuning recipe
https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/minimax_m2/minimax_m2.7_hellaswag_pp.yaml
# Distributed training documentation
https://github.com/NVIDIA-NeMo/Automodel/discussions/1786Transformers : Model can be loaded via HuggingFace Transformers (see deployment guide in the repository).
ModelScope : Weights available at https://modelscope.cn/models/MiniMax/MiniMax-M2.7
Tool Calling and Thinking Mode
M2.7 supports both tool calling and a Thinking mode.
Tool calling example (Python, OpenAI client) :
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"}
},
"required": ["location"]
}
}
}]
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M2.7",
messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
tools=tools
)
message = response.choices[0].message
if message.tool_calls:
for tool_call in message.tool_calls:
print(f"Tool Call: {tool_call.function.name}")
print(f" Arguments: {tool_call.function.arguments}")Thinking mode : Enclose reasoning with <think>...</think> tags so streaming outputs can separate thought from final answer.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
