Baobao Algorithm Notes
Mar 19, 2025 · Artificial Intelligence
Why Does GRPO Loss Start at Zero and Grow During OpenR1 Training?
The article explains why the GRPO loss in OpenR1 and trl starts at zero and then rises, detailing the underlying KL‑divergence formulation, the single‑step update mechanism, and how gradients are preserved despite a zero scalar loss, with code examples from the trl implementation.
GRPOLoss InitializationOpenR1
0 likes · 5 min read
