MiniMax M2.7 Self‑Trains and Rivals GPT‑5 & Opus 4.6 on Eight Benchmarks

MiniMax M2.7, released just a month after M2.5, introduces a self‑evolution training loop and achieves competitive scores on eight benchmarks—matching or surpassing Claude Opus 4.6, GPT‑5.4, Sonnet 4.6 and Gemini 3.1 Pro—while showcasing autonomous skill building, multi‑agent collaboration, and real‑world productivity applications.

AI Insight Log
AI Insight Log
AI Insight Log
MiniMax M2.7 Self‑Trains and Rivals GPT‑5 & Opus 4.6 on Eight Benchmarks

MiniMax officially launched M2.7, describing it as the first model that "deeply participates in its own evolution." The authors explain that the model assists itself during training, a concept they term "self‑training."

Benchmark performance : The model was evaluated on eight high‑profile benchmarks against Claude Opus 4.6, Sonnet 4.6, GPT‑5.4 and Gemini 3.1 Pro.

SWE Bench Pro : 56.2% (close to GPT‑5.4’s 57.7%, ahead of Gemini 3.1 Pro’s 54.2%).

Multi‑SWE Bench : 52.5%, slightly above Sonnet 4.6’s 51% and well ahead of GPT‑5.4’s 49%.

VIBE‑Pro (repo‑level code generation) : 56.2%, essentially equal to Sonnet 4.6’s 56%.

Machine‑learning autonomy : MLE‑Bench Lite awarded a 66.6% medal rate, second only to Opus 4.6’s 71.2% and far above Sonnet 4.6 (55.6%) and M2.5 (51.5%).

GDPval‑AA : ELO score 1495, the highest among open‑source models.

Toolathlon : Accuracy 46.3% (average performance).

MM‑ClawBench : 62.7%, a large jump from M2.5’s 42.5% but still behind Sonnet 4.6’s 75.6%.

Artificial Analysis : Score 57, on par with Opus 4.6.

The authors highlight the "self‑training" mechanism: during the reinforcement‑learning phase, the model actively participates in the training loop, autonomously building over 40 complex skills (each >2000 tokens) with a 97% compliance rate, maintaining a persistent memory system that updates architecture based on feedback, and completing more than 100 rounds of self‑optimisation.

This approach mirrors DeepSeek’s earlier "aha moment" but extends it by letting the model directly contribute to its own capability construction rather than merely producing insights at inference time. Official figures claim that M2.7 can handle 30%‑50% of tasks traditionally performed by human researchers.

In a reinforcement‑learning scenario, a researcher starts with an experimental idea, discusses it with an agent, and the agent assists with literature review, tracks specifications, pipelines data, and launches the experiment. Throughout the run the agent monitors, analyses logs, triggers debugging, metric analysis, code fixes, pull‑requests and smoke‑tests, and configures subtle yet critical changes. Human researchers intervene only for key decisions, accelerating problem discovery and model delivery.

Multi‑agent collaboration is a core feature of M2.7 via the native Agent Teams framework, which provides:

Role differentiation: distinct agents with clear responsibilities.

Dynamic tool search: agents can discover and invoke required tools autonomously.

Research Agent: a specialised agent framework for self‑iterative research.

Real‑world deployments demonstrate the model’s versatility:

Software engineering : production‑grade debugging that reduces mean time to recovery to under three minutes, with causal reasoning over monitoring metrics, trace analysis and database verification.

Office productivity : enhanced Word, Excel and PowerPoint editing, multi‑turn modifications, template‑based document generation, and the ability to read financial reports, build revenue models and generate presentation material.

Interactive entertainment : the open‑source OpenRoom framework (GitHub: https://github.com/MiniMax-AI/OpenRoom) offers GUI‑based agent interaction with real‑time visual feedback and role consistency.

Highlights :

The "self‑evolution" training paradigm could reshape efficiency limits if it continues to succeed.

MLE‑Bench Lite’s 66.6% score provides a concrete metric for genuine model intelligence.

Rapid iteration: M2.7 arrived only one month after M2.5.

Pricing remains low‑cost, at 1.2 USD per million tokens (prompt caching at 0.06 USD per million tokens).

Shortcomings :

On comprehensive benchmarks such as MM‑ClawBench and GDPval‑AA, M2.7 still trails Opus 4.6 and Sonnet 4.6.

The "self‑training" claim lacks detailed technical disclosure and awaits broader third‑party verification.

MiniMax continues to operate quietly, yet the M2 series’ open‑source release (M2.5) followed by the self‑evolutionary M2.7 marks a distinct technical direction away from merely scaling parameters and data.

Both the MiniMax Agent platform (agent.minimax.io) and the MiniMax API (platform.minimax.io) are now live for developers.

As MiniMax put on Twitter, "Go break it (we mean it)"—the community is invited to validate the model’s claims.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

benchmarkself-trainingMinimaxGPT-5Claude OpusAgent TeamsM2.7ML autonomy
AI Insight Log
Written by

AI Insight Log

Focused on sharing: AI programming | Agents | Tools

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.