Can Adaptive Guidance Unlock Small Model Reasoning? Introducing G²RPO‑A
The paper identifies reward sparsity as the core obstacle for small language models in reinforcement‑learning‑based reasoning, proposes G²RPO‑A which injects high‑quality thinking trajectories and dynamically adjusts guidance length, and demonstrates large accuracy gains on math and code benchmarks such as Qwen3‑1.7B improving from 50.96 % to 67.21 % on MATH500 and from 46.08 % to 75.93 % on HumanEval.
