Can Adaptive Guidance Unlock Small Model Reasoning? Introducing G²RPO‑A
The paper identifies reward sparsity as the core obstacle for small language models in reinforcement‑learning‑based reasoning, proposes G²RPO‑A which injects high‑quality thinking trajectories and dynamically adjusts guidance length, and demonstrates large accuracy gains on math and code benchmarks such as Qwen3‑1.7B improving from 50.96 % to 67.21 % on MATH500 and from 46.08 % to 75.93 % on HumanEval.
Reward sparsity in small language models (SLMs)
Reinforcement‑learning methods such as GRPO achieve large gains on models >7B parameters but improve 1.7B‑parameter SLMs only marginally. Analysis shows that most roll‑outs of SLMs receive zero reward because the models cannot generate high‑quality reasoning chains, leading to an extremely sparse reward distribution (e.g., heat‑map of Qwen3‑1.7B on code tasks).
“We trained GRPO on Qwen3‑1.7B and the high‑reward candidates were always too few, making it hard for the model to learn effective reasoning strategies…”
G²RPO‑A architecture
G²RPO‑A (Guided Group Relative Policy Optimization with Adaptive Guidance) injects high‑quality thinking trajectories into a subset of roll‑outs and dynamically adjusts the guidance length based on recent reward trends.
Guidance mechanism: During generation, partial high‑quality reasoning paths are inserted, steering the SLM toward better answer candidates.
Adaptive adjustment: The guidance length is increased when recent rewards fall below the historical average and decreased when rewards exceed the average, implementing an “intelligent gear‑shift”.
Why fixed‑length (naïve) guidance fails
Experiments on the Math‑220K subset show that a constant guidance length yields only a brief early boost. Although raw reward rises, the standard deviation of the advantage signal collapses, preventing effective optimization. The method makes high‑reward samples easier to obtain but does not preserve a discriminative advantage signal.
Adaptive update rule
For training step k, let m = min(T, k) be the window size, ℓ_k the current guidance length, and r_k the current reward. The rule updates ℓ_k as follows:
if r_k > (1/m) Σ_{i=k-m+1}^{k} r_i: ℓ_{k+1} = ℓ_k - Δ
else: ℓ_{k+1} = ℓ_k + ΔThus guidance shortens when recent rewards rise and lengthens when they fall, creating a self‑regulating schedule.
Main experimental results
Mathematics benchmarks
Qwen3‑1.7B‑Base on MATH500: 50.96 % → 67.21 % (Δ +16.25 %).
Qwen3‑1.7B‑Base on GPQA: 27.45 % → 32.35 % (Δ +4.90 %).
Qwen3‑8B‑Base on MATH500: 71.32 % → 82.08 % (Δ +10.76 %).
AIME24/25 with Qwen3‑1.7B: 63.33 % / 53.33 % vs. GRPO 56.67 % / 50.00 %.
Code benchmarks
Qwen3‑0.6B HumanEval: 32.32 % → 44.96 % (Δ +12.64 %).
Qwen3‑0.6B LiveCodeBench: 17.07 % → 23.14 % (Δ +6.07 %).
Qwen3‑1.7B HumanEval: 46.08 % → 75.93 % (Δ +29.85 %).
Aggregated Code‑Avg for Qwen3‑1.7B: 63.95 % (G²RPO‑A) > GRPO 60.40 % > Clip‑Higher 60.19 %.
Key observations
Naïve fixed guidance raises raw reward but collapses the advantage variance, limiting training efficiency.
Guidance ratio and proportion are more critical for code tasks; small models depend on guidance more heavily than larger models.
Limitations
Evaluations focus on mathematics and code; cross‑modal tasks remain untested. The guidance‑ratio hyper‑parameter α is manually tuned, so fully automatic adaptation is an open problem.
Paper: https://arxiv.org/abs/2508.13023<br/>Code repository: https://github.com/T-Lab-CUHKSZ/G2RPO-A
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
