Machine Heart
May 6, 2026 · Artificial Intelligence
Can Adaptive Guidance Unlock Small Model Reasoning? Introducing G²RPO‑A
The paper identifies reward sparsity as the core obstacle for small language models in reinforcement‑learning‑based reasoning, proposes G²RPO‑A which injects high‑quality thinking trajectories and dynamically adjusts guidance length, and demonstrates large accuracy gains on math and code benchmarks such as Qwen3‑1.7B improving from 50.96 % to 67.21 % on MATH500 and from 46.08 % to 75.93 % on HumanEval.
Code GenerationG²RPO‑Aadaptive guidance
0 likes · 10 min read
