Artificial Intelligence 9 min read

MiniMax M3: How a 1M‑Token, Multimodal Agent Reproduces ICLR Research and Automates Kaggle Competitions

The MiniMax M3 model combines a 1‑million‑token context window, native multimodal training and a new MiniMax Sparse Attention architecture that cuts token compute to one‑twentieth of its predecessor, achieving up to 15× faster decoding, while its interactive user‑simulator training enables fully autonomous agents that can reproduce ICLR‑2025 research and tackle Auto‑Kaggle competitions at a fraction of the cost of Western models.

Baobao Algorithm Notes

Jun 2, 2026

MiniMax M3: How a 1M‑Token, Multimodal Agent Reproduces ICLR Research and Automates Kaggle Competitions

I recently experimented with an Auto‑Kaggle agent built on Claude's /goal command, but long‑term planning remains a weakness for domestic models and foreign models are prohibitively expensive. The recent release of MiniMax M3 caught my attention because it appears to embody the "engineer‑grade" agent I need.

MiniMax M3 is the first Chinese model that integrates the so‑called "Frontier three‑suite": engineering‑grade coding/agentic capabilities, a 1‑million‑token context window, and native multimodal support. These three components are designed together rather than being retrofitted.

The model’s new architecture, MiniMax Sparse Attention (MSA), reduces the attention matrix to block‑wise processing: the matrix is partitioned, the outer layer handles blocks, and the inner layer aggregates queries that hit each block. Each block is read once with contiguous memory access, eliminating the costly initial scan typical of traditional sparse attention. In practice, MSA achieves more than a 4× speedup over open‑source Flash‑Sparse‑Attention and Flash‑Moba implementations.

Performance numbers are striking: with a 1‑M token context, M3’s per‑token compute is only 1/20 of its predecessor; prefilling is accelerated by over 9× and decoding by over 15×. In multiple ablation studies, MSA’s accuracy matches full attention on most tasks, suggesting a scalable dimension beyond merely increasing model parameters.

MiniMax also released an ICLR 2025 Outstanding Paper Award work titled "Learning Dynamics of LLM Finetuning". I reproduced the paper independently, and the M3 model autonomously ran for nearly 12 hours, generating 18 commits and 23 experiment charts. It reproduced the core SFT prediction‑probability trends, observed the reported squeezing effect, and successfully validated the Extend mitigation method.

Training incorporates a novel interactive user‑simulator framework. At any dialogue turn, the model samples multiple plausible user responses, expands the conversation through repeated sampling, and scores the entire interaction to evaluate multi‑turn quality. This differs from traditional RL, which rewards only the immediate next step. The simulator introduces a multi‑round perception reward, encouraging the agent to act proactively and align with real‑world usage scenarios.

Multimodal capability is native rather than an afterthought. M3 was trained from step 0 with interleaved text‑image, caption, and video data, scaling the multimodal pre‑training corpus to the 100 TB level. This unified pipeline enables the model to understand figures, formulas, and code within papers, and to process long‑thread multimodal tasks without a semantic gap between text and vision encoders.

To demonstrate practical use, I set up an Auto‑Kaggle workflow with M3. After configuring the Kaggle CLI ( pip install kaggle) and adding the API tokens, I prompted the model with a natural‑language instruction to compete in a bike‑demand‑prediction challenge. The model initially tried sklearn, switched to XGBoost and LightGBM, performed feature engineering, and produced competitive results—all without human intervention. Token consumption was roughly 1/15 of Claude’s subscription cost.

Overall, MiniMax M3 showcases a scalable sparse‑attention design, massive context handling, and native multimodality, coupled with a user‑simulator training loop that yields truly autonomous agents capable of complex, long‑term tasks such as reproducing research and running Kaggle competitions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language model multimodal agentic AI MiniMax sparse attention Auto Kaggle M3

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.