Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning
The article introduces SRPO, a two‑stage history‑resampling reinforcement‑learning framework that systematically tackles common GRPO training issues and achieves state‑of‑the‑art performance on both math and code benchmarks with far fewer training steps, while also revealing emergent self‑reflection behaviors in large language models.