How MiroMind‑M1 Sets New Benchmarks in Open‑Source Math Reasoning

The article presents MiroMind‑M1, an open‑source math‑reasoning language model that combines a 719K high‑quality SFT dataset, a novel CAMPO reinforcement‑learning algorithm, and extensive evaluations on AIME24, AIME25, and MATH‑500, demonstrating state‑of‑the‑art performance while reducing token usage.

Data Party THU
Data Party THU
Data Party THU
How MiroMind‑M1 Sets New Benchmarks in Open‑Source Math Reasoning

Background

Mathematical reasoning requires multi‑step logical and abstract thinking, making it a stringent benchmark for evaluating reasoning language models (RLMs). Recent advances in large language models have shifted focus from pure text generation to sophisticated cross‑domain reasoning.

MiroMind‑M1 Overview

MiroMind‑M1 is an open‑source family of math‑reasoning models released by MiroMind (Jizhi Evolution). The entire training pipeline—including data, code, model checkpoints, and evaluation scripts—is publicly available. The code repository is hosted at https://github.com/MiroMindAsia/MiroMind-M1 and model checkpoints are on Hugging Face.

MiroMind‑M1 overview
MiroMind‑M1 overview

SFT Data Construction

The supervised‑fine‑tuning (SFT) stage uses a curated dataset of 719 K high‑quality math‑reasoning problems with verified chain‑of‑thought (CoT) annotations. Sources include OpenR1, OpenThoughts, Light‑R1, and Synthetic‑1. After extensive deduplication and contamination removal, the final dataset provides a clean training foundation.

Dataset link:

https://huggingface.co/datasets/miromind-ai/MiroMind-M1-SFT-719K

Experimental results (Table 2) show that MiroMind‑M1‑SFT‑7B achieves:

AIME24: 60.5

AIME25: 45.0

MATH‑500: 94.6

These scores surpass other open‑source SFT models of comparable size and even exceed the DeepSeek‑R1 distilled 7B model.

SFT performance
SFT performance

CAMPO Reinforcement Learning

The reinforcement‑learning (RL) stage introduces the Context‑Aware Multi‑Phase Optimization (CAMPO) algorithm. CAMPO gradually expands the permissible context length across training phases while applying an adaptive redundancy penalty that discourages early token repetition.

Two complementary mechanisms:

Stage‑wise length progression : the model learns long‑chain reasoning under controlled compute by increasing the maximum context length in successive phases.

Redundancy penalty : during reward calculation, repetitions that appear early incur a heavy penalty, whereas repetitions near the end receive a milder penalty, encouraging concise yet complete answers.

The overall training objective balances these factors to produce efficient, high‑quality reasoning outputs.

CAMPO algorithm diagram
CAMPO algorithm diagram

RL Experimental Results

For the 32B model, CAMPO‑enhanced RL yields:

AIME24 improvement: +6.7% over same‑size baselines

AIME25 improvement: +13.5% over same‑size baselines

Token consumption reduced by ~20% while maintaining accuracy

The 32B model still trails the latest Skywork‑OR1‑32B‑Preview on AIME25, likely due to training exclusively on pure math data.

The 7B RL model, built on the SFT checkpoint, achieves the best performance among Qwen‑2.5‑based open‑source models of the same scale.

Ablation studies confirm that the redundancy penalty shortens answer length without harming accuracy and stabilizes RL training curves.

RL performance
RL performance

Resources

Code repository: https://github.com/MiroMindAsia/MiroMind-M1 Model checkpoints:

SFT 7B: https://huggingface.co/miromind-ai/MiroMind-M1-SFT-7B RL 7B: https://huggingface.co/miromind-ai/MiroMind-M1-RL-7B RL 32B: https://huggingface.co/miromind-ai/MiroMind-M1-RL-32B Datasets:

SFT data:

https://huggingface.co/datasets/miromind-ai/MiroMind-M1-SFT-719K

RL data:

https://huggingface.co/datasets/miromind-ai/MiroMind-M1-RL-62K

Paper (arXiv):

https://arxiv.org/abs/2507.14683
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

evaluationopen-source LLMmath reasoningCAMPO
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.