MathForge: Leveraging Hard Problems in RL to Boost Large‑Model Mathematical Reasoning (ICLR 2026)
MathForge tackles the long‑standing question of which math problems deserve focus in reinforcement‑learning‑based training, introducing a difficulty‑aware optimizer (DGPO) and multi‑aspect question reformulation (MQR) that together prioritize harder‑but‑learnable questions, yielding consistent performance gains across model sizes and modalities.
Problem
In reinforcement learning with verifiable rewards for large‑model mathematical reasoning, the open question is which questions should receive more training emphasis. Easy questions give limited benefit; unsolvable questions provide weak learning signals; the most valuable are harder‑but‑learnable questions.
Limitations of Existing Methods
Group Policy Optimization (GRPO) normalizes advantage by the standard deviation, which concentrates update strength on medium‑difficulty questions and suppresses both very easy and very hard ones. Existing data‑augmentation either generates entirely new problems—risking answer quality for high‑difficulty competition problems—or merely paraphrases existing questions without increasing intrinsic difficulty.
MathForge Framework
MathForge introduces a dual‑wheel design: Difficulty‑aware Group Policy Optimization (DGPO) and Multi‑Aspect Question Reformulation (MQR).
DGPO – Balancing and Re‑weighting Updates
DGPO consists of Difficulty‑balanced Group Advantage Estimation (DGAE) and Difficulty‑aware Question‑level Weighting (DQW).
DGAE replaces the standard‑deviation normalization in GRPO with Mean Absolute Deviation (MAD). The paper proves (Theorem 2) that, without binary correctness rewards, DGAE yields a constant total update magnitude for every question, eliminating the bias toward medium difficulty.
DQW estimates difficulty from the current average accuracy of each question and assigns higher weights to harder yet still learnable items. The weighting formula is given in the paper.
MQR – Hardening Questions While Preserving Answers
MQR applies three orthogonal transformations to each original question, guaranteeing that the gold answer remains unchanged.
Background : Add seemingly relevant but distracting context, forcing the model to isolate the true mathematical conditions.
Term : Introduce new abstract terminology for core concepts, preventing reliance on surface forms.
Sub‑Problem : Split a key numeric condition into a prerequisite sub‑problem, extending the reasoning chain.
All transformations are systematic, ensuring the reformulated question is harder while the answer stays identical.
Experimental Evaluation
Across multiple benchmarks the authors report:
DGPO alone outperforms the strong GRPO baseline.
MQR alone also surpasses GRPO.
Combining DGPO and MQR (full MathForge) yields the best performance, improving average scores by more than 4.5 points over GRPO.
Gains are consistent across model scales from small models up to 7 B parameters and across different backbone architectures.
Ablation studies confirm that both DGAE and DQW are necessary and complementary.
DGPO can be plugged into various existing RL optimizers, providing additional improvements.
The difficulty‑aware principle also benefits multimodal math‑reasoning tasks, with gains exceeding 2 points.
Controlled experiments equalizing total training steps show that MQR’s benefit is not merely from increased data volume; reformulated data consistently outperforms the original.
Training‑dynamics visualizations show that DGPO leads to more accurate and concise model outputs, while MQR results in lower training accuracy but higher test performance, illustrating the “train harder, test better” effect.
Conclusion
MathForge identifies “hard‑but‑learnable” questions as the high‑value training signal. DGPO balances update magnitudes across difficulty and re‑weights toward harder learnable items; MQR reliably generates such questions without altering the gold answer. The combined approach converts harder training into stronger reasoning.
Paper: https://arxiv.org/abs/2601.20614
Code: https://github.com/AMAP-ML/MathForge
Code example
来源:机器之心
本文
约3500字
,建议阅读
5
分钟
MathForge 真正回答了一个非常关键的问题:在强化学习里,哪些题最值得学?Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
