MathForge: Leveraging Hard Problems in RL to Boost Large‑Model Mathematical Reasoning (ICLR 2026)

MathForge tackles the long‑standing question of which math problems deserve focus in reinforcement‑learning‑based training, introducing a difficulty‑aware optimizer (DGPO) and multi‑aspect question reformulation (MQR) that together prioritize harder‑but‑learnable questions, yielding consistent performance gains across model sizes and modalities.

Data Party THU
Data Party THU
Data Party THU
MathForge: Leveraging Hard Problems in RL to Boost Large‑Model Mathematical Reasoning (ICLR 2026)

Problem

In reinforcement learning with verifiable rewards for large‑model mathematical reasoning, the open question is which questions should receive more training emphasis. Easy questions give limited benefit; unsolvable questions provide weak learning signals; the most valuable are harder‑but‑learnable questions.

Limitations of Existing Methods

Group Policy Optimization (GRPO) normalizes advantage by the standard deviation, which concentrates update strength on medium‑difficulty questions and suppresses both very easy and very hard ones. Existing data‑augmentation either generates entirely new problems—risking answer quality for high‑difficulty competition problems—or merely paraphrases existing questions without increasing intrinsic difficulty.

MathForge Framework

MathForge introduces a dual‑wheel design: Difficulty‑aware Group Policy Optimization (DGPO) and Multi‑Aspect Question Reformulation (MQR).

DGPO – Balancing and Re‑weighting Updates

DGPO consists of Difficulty‑balanced Group Advantage Estimation (DGAE) and Difficulty‑aware Question‑level Weighting (DQW).

DGAE replaces the standard‑deviation normalization in GRPO with Mean Absolute Deviation (MAD). The paper proves (Theorem 2) that, without binary correctness rewards, DGAE yields a constant total update magnitude for every question, eliminating the bias toward medium difficulty.

DQW estimates difficulty from the current average accuracy of each question and assigns higher weights to harder yet still learnable items. The weighting formula is given in the paper.

MQR – Hardening Questions While Preserving Answers

MQR applies three orthogonal transformations to each original question, guaranteeing that the gold answer remains unchanged.

Background : Add seemingly relevant but distracting context, forcing the model to isolate the true mathematical conditions.

Term : Introduce new abstract terminology for core concepts, preventing reliance on surface forms.

Sub‑Problem : Split a key numeric condition into a prerequisite sub‑problem, extending the reasoning chain.

All transformations are systematic, ensuring the reformulated question is harder while the answer stays identical.

Experimental Evaluation

Across multiple benchmarks the authors report:

DGPO alone outperforms the strong GRPO baseline.

MQR alone also surpasses GRPO.

Combining DGPO and MQR (full MathForge) yields the best performance, improving average scores by more than 4.5 points over GRPO.

Gains are consistent across model scales from small models up to 7 B parameters and across different backbone architectures.

Ablation studies confirm that both DGAE and DQW are necessary and complementary.

DGPO can be plugged into various existing RL optimizers, providing additional improvements.

The difficulty‑aware principle also benefits multimodal math‑reasoning tasks, with gains exceeding 2 points.

Controlled experiments equalizing total training steps show that MQR’s benefit is not merely from increased data volume; reformulated data consistently outperforms the original.

Training‑dynamics visualizations show that DGPO leads to more accurate and concise model outputs, while MQR results in lower training accuracy but higher test performance, illustrating the “train harder, test better” effect.

Conclusion

MathForge identifies “hard‑but‑learnable” questions as the high‑value training signal. DGPO balances update magnitudes across difficulty and re‑weights toward harder learnable items; MQR reliably generates such questions without altering the gold answer. The combined approach converts harder training into stronger reasoning.

Paper: https://arxiv.org/abs/2601.20614

Code: https://github.com/AMAP-ML/MathForge

Code example

来源:机器之心
本文
约3500字
,建议阅读
5
分钟
MathForge 真正回答了一个非常关键的问题:在强化学习里,哪些题最值得学?
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsreinforcement learningMathematical ReasoningDGPODifficulty‑Aware OptimizationMathForgeMQR
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.