How MathForge Uses Hard Problems to Boost Large‑Model Mathematical Reasoning via Reinforcement Learning
MathForge tackles the overlooked issue of training large language models on mathematically challenging yet learnable problems by introducing a difficulty‑aware group policy optimization (DGPO) and multi‑aspect question reformulation (MQR), achieving consistent gains across model sizes and modalities.
1. Why existing methods ignore hard problems
Reinforcement Learning with Verifiable Rewards (RLVR) has become a key approach for improving mathematical reasoning in large models because it directly checks answer correctness without extra reward models. However, current methods neglect hard questions for two reasons.
Algorithmic level: The widely used Group Policy Optimization (GRPO) compares multiple answers to the same question and updates based on relative advantage. Theory shows GRPO’s update strength is biased toward medium‑difficulty questions, while updates for overly easy or overly hard questions are suppressed. Consequently, the most valuable “hard‑but‑learnable” problems receive insufficient training signal.
Data level: Existing data‑augmentation techniques either generate entirely new questions—often failing to guarantee answer quality for high‑difficulty competition problems—or simply rephrase original questions, which does not increase intrinsic difficulty.
2. MathForge: Dual‑sided improvement
To address both shortcomings, the authors propose MathForge, a framework consisting of two core components:
DGPO (Difficulty‑Aware Group Policy Optimization) : balances update strength across difficulties and then re‑weights harder, learnable questions.
MQR (Multi‑Aspect Question Reformulation) : systematically makes questions harder while keeping the gold answer unchanged.
2.1 DGPO – Let “hard‑but‑learnable” questions be truly learned
DGPO follows a two‑step process: balance first, then re‑weight .
Step 1 – DGAE (Difficulty‑balanced Group Advantage Estimation): The authors identify that GRPO’s normalization of advantage leads to unequal update magnitudes across difficulties. They replace the standard‑deviation normalization with Mean Absolute Deviation (MAD), computing group advantage as the sum of absolute advantages of all answers. Theorem 1 proves that, under binary correctness rewards, GRPO’s total update for a question is proportional to
(sampling count) and
(accuracy). The update magnitude peaks when accuracy is 0.5 and declines toward 0 or 1, meaning medium‑difficulty questions dominate updates.
By using MAD, DGAE equalizes the total update across questions regardless of difficulty (Theorem 2), removing the bias toward medium difficulty.
2.2 DGPO – Difficulty‑aware Question‑level Weighting (DQW)
After balancing, DQW estimates difficulty from the current average correctness rate and assigns higher weights to harder questions that still provide learning signal. The weighting formula is illustrated in the following diagram:
2.3 MQR – Making questions harder while keeping answers
MQR addresses “what to learn”. It rewrites each question along three axes, preserving the original gold answer:
Background: Add seemingly relevant but distracting information, forcing the model to locate the true mathematical condition.
Term: Replace core concepts with new abstract terminology, requiring genuine understanding of definitions.
Sub‑Problem: Transform a key numeric condition into a sub‑question that must be solved before tackling the main problem, extending the reasoning chain.
All rewrites maintain the original answer, ensuring reliable training signals without inflating dataset size.
3. Experimental results
Experiments demonstrate that harder training data yields stronger, more stable, and more generalizable reasoning.
Table 1 shows that both DGPO alone and MQR alone surpass the strong GRPO baseline, and the combined MathForge achieves the best performance, improving average scores by over 4.5 points. This advantage holds across multiple baselines.
Table 2 indicates consistent gains (≈3–4.5 points) for models ranging from small to 7B parameters, confirming the method’s scalability.
Algorithmic ablations (Table 3) verify that both DGAE and DQW are necessary and complementary. Table 4 shows DGPO can be plugged into various existing RL methods for additional improvements.
Table 5 extends DGPO to multimodal math reasoning, again yielding >2‑point gains, suggesting the difficulty‑aware training principle is broadly applicable.
Data‑centric analyses (Tables 6‑7) control for total training volume and reveal that MQR‑rewritten data outperform original data, with each of the three rewrite strategies contributing positively; the Sub‑Problem rewrite most significantly raises difficulty.
Training dynamics visualized in Figure 1 show DGPO produces shorter, more accurate outputs, indicating more efficient reasoning paths. Figure 2 illustrates the “train harder, test better” effect: MQR lowers training accuracy but improves final test performance.
4. Conclusion
MathForge answers the crucial question of which problems are worth learning in reinforcement‑learning‑based math reasoning. The answer is not the easiest nor the impossible, but the harder‑but‑learnable problems. DGPO ensures the model truly focuses on these, while MQR reliably generates such problems. Together they turn “harder training” into “stronger reasoning”. The core insight aligns with the paper’s title: Harder Is Better .
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
