MiniMax M3: First Open‑Source Model to Achieve the Frontier Trio – Our Three‑Task Evaluation
MiniMax M3 claims to be the first open‑source LLM that simultaneously delivers top‑tier coding/agentic ability, a 1‑million‑token context window, and native multimodal understanding, and our benchmarks on coding suites, long‑context efficiency, and multimodal tasks confirm it exceeds expectations.
MiniMax M3 was announced as the first domestic open‑source model to combine the three "Frontier" capabilities—strong coding/agentic performance, a million‑token context window, and native multimodal support. The authors set out three questions to verify these claims: does the coding and agentic ability hold up, is the 1M context genuinely useful, and does multimodal understanding help real tasks.
Frontier Trio Capabilities
The article defines the "Frontier" trio as:
Powerful coding/agentic ability (able to handle real software‑engineering tasks)
Million‑token context window
Native multimodal support (visual information fused from the pre‑training stage)
Only closed‑source models such as Claude Opus 4.7, Gemini 3.1 and GPT‑5.5 previously satisfied all three.
Coding and Agentic Evaluation
Official benchmark numbers place M3 in the top international tier: SWE‑Bench Pro 59.0% (ahead of GPT‑5.5 and Gemini 3.1 Pro, close to Opus 4.7), Terminal Bench 2.1 score 66.0%, KernelBench Hard 28.8%, and it ranks first on Claw‑Eval for end‑to‑end agent evaluation.
Beyond raw scores, the model can sustain long‑thread tasks, iteratively refining code, self‑verifying, and refusing to give up. In an internal CUDA kernel‑optimization test, M3 started from an incomplete Triton skeleton and, over ~24 hours, submitted 147 benchmark runs and invoked tools 1 959 times, raising hardware utilization from 7.6% to 71.3% (9.4× speed‑up). The model continued improving past the 145th submission, whereas most competitors stopped after ~30 submissions, illustrating a "persistence of exploration" behavior.
Training incorporated an "interactive user simulator" that forces the model to handle multi‑turn clarification, feedback‑driven plan adjustment, and cross‑task switching, making its agent ability more realistic than benchmark‑only tuning.
1‑Million‑Token Context Window
A million‑token window corresponds to roughly fifteen long novels or tens of thousands of lines of code plus full project documentation. The authors argue that such a window is a foundational infrastructure for long‑range agents, video understanding, and complex multi‑turn collaboration; without it, higher‑level capabilities cannot reliably function.
Native Multimodal Capability
M3’s multimodal training interleaves text, image‑caption pairs, and video data from the first step, producing a tightly aligned semantic space. This enables seamless handling of tasks that require simultaneous understanding of formulas, code comments, and experimental figures.
The pre‑training corpus was expanded to 100 T tokens, and experiments showed that large‑scale interleaved data dramatically improves visual comprehension, shifting the model from "describe‑the‑image" to "understand‑visual‑information‑in‑context".
On the OmniDocBench multimodal benchmark, M3 outperformed Gemini 3.1 Pro, confirming the effectiveness of the native multimodal route.
Practical Tests Conducted by the Authors
Task 1 – Token‑Plan Comparison Web Tool : The model was asked to research token pricing from major LLM providers, organize the data, and build a web‑based comparison tool. M3 retrieved the pages, structured the data, and delivered a functional tool, adding extra features such as grouping, currency conversion, and theme switching without prompting.
Task 2 – Multimodal UI Generation : Given a cat image, the model was instructed to create a music player UI that reacts to cursor movement. M3 generated the full player, fetched album covers and lyrics, and even added a clickable‑lyrics feature, demonstrating multimodal reasoning and creative UI synthesis.
Task 3 – Video‑to‑Article Conversion : A 40‑minute, 270 MB video of Andrej Karpathy’s "Software in the AI Era" talk was supplied with a prompt to produce a ~5 000‑word media report with appropriate sections and screenshots. After 16 minutes of processing, M3 output a markdown article and image folder that required virtually no editing, showing high‑quality long‑form generation from multimodal input.
Technical Deep‑Dive: MiniMax Sparse Attention (MSA)
To support the 1M context, M3 introduces MiniMax Sparse Attention, a blockwise sparse attention mechanism. Standard full attention scales quadratically, making a 1M context ~1000× more expensive than a 32K context. MSA partitions KV caches into finer blocks and selects the most relevant blocks, reducing per‑token computation to 1/20 of the previous generation.
At the kernel level, M3 uses a KV‑outer‑gather‑Q strategy where each block is read once with contiguous memory access, yielding >4× speed‑up over Flash‑Sparse‑Attention and FlashMoBA. Reported gains include:
Prefill stage >9× faster
Decoding stage >15× faster
Overall per‑token cost 1/20 of the prior model
Training with MSA showed no loss spikes and matched full‑attention performance on most tasks, while supporting native 32K+ pre‑training contexts.
Product Ecosystem and Pricing
MiniMax also released two companion products: MiniMax Code, an agent workflow that decomposes large tasks into concurrent, dynamically adjustable subtasks using a Producer + Verifier loop; and a Token Plan with three tiers—Plus (¥49/month for 600 M tokens), Max (¥119/month for 1.8 B tokens), Ultra (¥469/month for 5.5 B tokens), roughly 15× the token volume of Claude’s subscription at the same price.
The authors note that despite the strong performance‑to‑price ratio, the model’s publicity remains modest.
Conclusion
After extensive testing, the authors conclude that MiniMax M3 is the first open‑source model that narrows the gap with leading closed‑source systems, delivering the Frontier trio in a locally deployable package. This shifts developer selection logic and opens possibilities for fine‑tuning and vertical integration.
The model’s technical report and weights are slated for open release within ten days, and the authors will continue monitoring its development.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
