Artificial Intelligence 11 min read

Deploy the Open‑Source MiniMax‑M2.7 Model Locally: Step‑by‑Step Guide

MiniMax‑M2.7, the newly open‑sourced 230‑billion‑parameter MoE model, offers self‑evolution, professional software engineering and agent capabilities, and can be deployed locally using Ollama, vLLM, SGLang or Docker with 4‑8 H200 GPUs, while the article details hardware needs, performance gains and tool‑calling/Thinking features.

Old Zhang's AI Learning

Apr 12, 2026

Deploy the Open‑Source MiniMax‑M2.7 Model Locally: Step‑by‑Step Guide

Model Highlights

Self‑evolution : after >100 autonomous optimization rounds, performance improves ~30%.

Software engineering : SWE‑Pro score 56.22%, comparable to GPT‑5.3‑Codex; production‑grade incident recovery <3 minutes.

Office abilities : GDPval‑AA ELO 1495 (highest among open‑source models); high‑fidelity multi‑turn editing of Word/Excel/PPT.

Native Agent Teams : supports multi‑agent collaboration with stable roles and autonomous decision‑making.

MiniMax‑M2.7 model technical specification blueprint

Code ability claimed comparable to GPT‑5.3‑Codex.

MiniMax‑M2.7 benchmark performance panorama

Deployment Ecosystem

Base GPU memory requirement ~230 GB. Two H200 GPUs are marginal; official recommendation is at least four H200 GPUs.

Quantized versions are under urgent development; only folder structure prepared, compression to a few dozen GB expected based on prior unSloth experience.

MiniMax‑M2.7 deployment ecosystem panorama

Ollama

Ollama 0.19 includes minimax-m2.7:cloud for free cloud inference (required because the 230‑billion‑parameter MoE model exceeds local GPU memory).

# Use with OpenClaw
ollama launch openclaw --model minimax-m2.7:cloud

# Direct chat
ollama run minimax-m2.7:cloud

vLLM

vLLM v0.19.0 adds HuggingFace v5 support, multimodal optimizations, and CPU KV‑cache offloading. Provides Day‑0 support.

# Basic deployment (4 × H200/H100/A100)
vllm serve MiniMaxAI/MiniMax-M2.7 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  --enable-auto-tool-choice \
  --trust-remote-code

# 8‑GPU deployment (DP+EP mode)
vllm serve MiniMaxAI/MiniMax-M2.7 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice

Docker one‑click start :

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:minimax27 MiniMaxAI/MiniMax-M2.7 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice \
  --trust-remote-code

Supported platforms:

NVIDIA : 4 × H200/H100/A100 tensor parallel, or 8‑GPU DP+EP/TP+EP modes.

AMD : 2 × or 4 × MI300X/MI325X/MI350X/MI355X with AITER acceleration.

System requirements: ~220 GB GPU memory for model weights plus ~240 GB for 1 M context tokens.

SGLang

SGLang provides Day‑0 support.

sglang serve \
  --model-path MiniMaxAI/MiniMax-M2.7 \
  --tp 4 \
  --tool-call-parser minimax-m2 \
  --reasoning-parser minimax-append-think \
  --trust-remote-code \
  --mem-fraction-static 0.85

The minimax-append-think parser separates reasoning from final output, enabling a “Thinking” mode.

Quick test:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M2.7",
    "messages": [
      {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
      {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
    ]
  }'

Recommended inference parameters: temperature=1.0, top_p=0.95, top_k=40.

NVIDIA Support

Free trial endpoint: https://build.nvidia.com/minimaxai/minimax-m2.7

Inference optimizations contributed by NVIDIA and the open‑source community:

QK RMS Norm Kernel : merges compute and communication into a single kernel, reducing launch overhead and memory traffic.

FP8 MoE : integrates TensorRT‑LLM’s FP8 MoE modular kernel, tuned for MoE models.

Throughput gains on NVIDIA Blackwell Ultra GPUs:

vLLM throughput increased 2.5× (within one month).

SGLang throughput increased 2.7× (within one month).

Additional Ecosystem Components

NemoClaw : NVIDIA open‑source reference stack for continuous operation of OpenClaw.

Fine‑tuning : NeMo AutoModel enables post‑training with EP + PP schemes; NeMo RL provides GRPO reinforcement‑learning recipes for 8K and 16K sequence lengths.

# NeMo AutoModel fine‑tuning recipe
https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/minimax_m2/minimax_m2.7_hellaswag_pp.yaml

# Distributed training documentation
https://github.com/NVIDIA-NeMo/Automodel/discussions/1786

Transformers : Model can be loaded via HuggingFace Transformers (see deployment guide in the repository).

ModelScope : Weights available at https://modelscope.cn/models/MiniMax/MiniMax-M2.7

Tool Calling and Thinking Mode

M2.7 supports both tool calling and a Thinking mode.

Tool calling example (Python, OpenAI client) :

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "The city name"}
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M2.7",
    messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
    tools=tools
)

message = response.choices[0].message
if message.tool_calls:
    for tool_call in message.tool_calls:
        print(f"Tool Call: {tool_call.function.name}")
        print(f"  Arguments: {tool_call.function.arguments}")

Thinking mode : Enclose reasoning with <think>...</think> tags so streaming outputs can separate thought from final answer.

LLM deployment vLLM GPU Ollama SGLang MiniMax M2.7

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.