Tag

history resampling

0 views collected around this technical thread.

Kuaishou Tech
Kuaishou Tech
Apr 24, 2025 · Artificial Intelligence

Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning

The article introduces SRPO, a two‑stage history‑resampling reinforcement‑learning framework that systematically tackles common GRPO training issues and achieves state‑of‑the‑art performance on both math and code benchmarks with far fewer training steps, while also revealing emergent self‑reflection behaviors in large language models.

LLM optimizationSRPOcross-domain training
0 likes · 12 min read
Two‑Stage History‑Resampling Policy Optimization (SRPO) for Large‑Scale LLM Reinforcement Learning