How Risk‑Sensitive Reinforcement Learning Improves LLM Pass@K Performance
This article analyzes why standard reinforcement learning can degrade Pass@K metrics after fine‑tuning large language models, introduces a risk‑sensitive RL objective that reshapes the advantage estimator, and demonstrates through bandit and mathematical‑reasoning experiments that the RS‑GRPO method consistently boosts diversity and overall Pass@K scores across multiple LLMs.
Core Insight
In traditional reinforcement learning (RL) agents start from a random policy, so RL never degrades performance relative to a baseline. Large language models (LLMs) are pre‑trained with a highly peaked policy distribution. Applying standard RL from this starting point can compress the distribution, improving the best‑case metric (Pass@1) while reducing diversity and lowering multi‑sample metrics such as Pass@K.
The authors illustrate the effect with a 100‑armed bandit simulation, showing that standard RL narrows the policy and harms Pass@K.
Problem with the Standard RL Objective
Standard RL maximizes the expected (mean) reward. This objective pushes the policy toward the highest‑probability outputs, further sharpening the distribution and degrading Pass@K performance.
Method: Risk‑Sensitive RL (RS‑GRPO)
The proposed solution replaces the mean reward with a smooth maximum, i.e., the log‑expectation sum of rewards. Only the advantage estimator in the policy‑gradient update changes, so the method can be inserted into existing GRPO implementations with minimal code changes.
https://github.com/Jackory/RS-GRPO/blob/main/codes/bandit.pyTheoretical Perspective
Standard policy‑gradient updates can reduce the selection probability of the optimal action.
With a sufficiently large risk parameter β, the risk‑sensitive update guarantees an increased probability of the optimal action.
Beyond a certain β threshold, convergence slows, revealing a trade‑off between exploration and speed.
Experimental Perspective
A 100‑armed bandit experiment confirms that larger β values raise the cumulative solve rate (exploration) while slowing reward convergence, acting as a regularizer.
Further experiments on mathematical‑reasoning benchmarks evaluate the impact of β on Pass@1 and Pass@32 across several models (QwenMath1.5B, QwenMath7B, Qwen7B, Qwen3‑4B‑Base, Llama3.1‑8B‑Instruct). Findings:
Higher β improves the cumulative solve rate, indicating stronger exploration.
Training‑reward convergence slows with larger β, matching the theoretical analysis.
Pass@1 peaks around β=2, while Pass@32 consistently benefits from β≥4.
Pass@K Evaluation on Multiple LLMs
RS‑GRPO was compared against the baseline and standard GRPO on the models above. RS‑GRPO achieved superior Pass@K curves on all models except Qwen7B and Llama3.1‑8B‑Instruct, where the weaker base performance limited the benefit. The authors attribute this to a larger distance between the initial policy distribution and the global optimum; larger β values or longer training may be required.
A tabular summary shows RS‑GRPO improves Pass@1 by 0–3 % and Pass@32 by 2–5 % over GRPO across models.
Related Work Comparison
Several recent papers also modify the advantage estimator to optimize Pass@K. Unlike those, RS‑GRPO works with continuous rewards and avoids the zero‑advantage issue that arises when sample accuracy exceeds (1‑K/N).
[1] Optimizing Language Models for Inference Time Objectives using Reinforcement Learning
[2] Walder and Karkhansi et al. Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
[3] Chen et al. Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models
[4] Mahdavi et al. Beyond Accuracy: A Policy Gradient Reweighting Approach for Pass@K Maximization in LLMsKey Takeaways
Transforming the advantage to optimize a smoothed maximum reward prevents policy sharpening caused by the peaked initial LLM policy, thereby improving exploration and overall Pass@K performance. This approach is orthogonal to entropy‑regularization methods and can be combined with them.
References
https://arxiv.org/abs/2504.13837Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
