How MiroMind‑M1 Sets New Benchmarks in Open‑Source Math Reasoning
The article presents MiroMind‑M1, an open‑source math‑reasoning language model that combines a 719K high‑quality SFT dataset, a novel CAMPO reinforcement‑learning algorithm, and extensive evaluations on AIME24, AIME25, and MATH‑500, demonstrating state‑of‑the‑art performance while reducing token usage.
Background
Mathematical reasoning requires multi‑step logical and abstract thinking, making it a stringent benchmark for evaluating reasoning language models (RLMs). Recent advances in large language models have shifted focus from pure text generation to sophisticated cross‑domain reasoning.
MiroMind‑M1 Overview
MiroMind‑M1 is an open‑source family of math‑reasoning models released by MiroMind (Jizhi Evolution). The entire training pipeline—including data, code, model checkpoints, and evaluation scripts—is publicly available. The code repository is hosted at https://github.com/MiroMindAsia/MiroMind-M1 and model checkpoints are on Hugging Face.
SFT Data Construction
The supervised‑fine‑tuning (SFT) stage uses a curated dataset of 719 K high‑quality math‑reasoning problems with verified chain‑of‑thought (CoT) annotations. Sources include OpenR1, OpenThoughts, Light‑R1, and Synthetic‑1. After extensive deduplication and contamination removal, the final dataset provides a clean training foundation.
Dataset link:
https://huggingface.co/datasets/miromind-ai/MiroMind-M1-SFT-719KExperimental results (Table 2) show that MiroMind‑M1‑SFT‑7B achieves:
AIME24: 60.5
AIME25: 45.0
MATH‑500: 94.6
These scores surpass other open‑source SFT models of comparable size and even exceed the DeepSeek‑R1 distilled 7B model.
CAMPO Reinforcement Learning
The reinforcement‑learning (RL) stage introduces the Context‑Aware Multi‑Phase Optimization (CAMPO) algorithm. CAMPO gradually expands the permissible context length across training phases while applying an adaptive redundancy penalty that discourages early token repetition.
Two complementary mechanisms:
Stage‑wise length progression : the model learns long‑chain reasoning under controlled compute by increasing the maximum context length in successive phases.
Redundancy penalty : during reward calculation, repetitions that appear early incur a heavy penalty, whereas repetitions near the end receive a milder penalty, encouraging concise yet complete answers.
The overall training objective balances these factors to produce efficient, high‑quality reasoning outputs.
RL Experimental Results
For the 32B model, CAMPO‑enhanced RL yields:
AIME24 improvement: +6.7% over same‑size baselines
AIME25 improvement: +13.5% over same‑size baselines
Token consumption reduced by ~20% while maintaining accuracy
The 32B model still trails the latest Skywork‑OR1‑32B‑Preview on AIME25, likely due to training exclusively on pure math data.
The 7B RL model, built on the SFT checkpoint, achieves the best performance among Qwen‑2.5‑based open‑source models of the same scale.
Ablation studies confirm that the redundancy penalty shortens answer length without harming accuracy and stabilizes RL training curves.
Resources
Code repository: https://github.com/MiroMindAsia/MiroMind-M1 Model checkpoints:
SFT 7B: https://huggingface.co/miromind-ai/MiroMind-M1-SFT-7B RL 7B: https://huggingface.co/miromind-ai/MiroMind-M1-RL-7B RL 32B: https://huggingface.co/miromind-ai/MiroMind-M1-RL-32B Datasets:
SFT data:
https://huggingface.co/datasets/miromind-ai/MiroMind-M1-SFT-719KRL data:
https://huggingface.co/datasets/miromind-ai/MiroMind-M1-RL-62KPaper (arXiv):
https://arxiv.org/abs/2507.14683Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
