Artificial Intelligence 5 min read

Why Does GRPO Loss Start at Zero and Grow During OpenR1 Training?

The article explains why the GRPO loss in OpenR1 and trl starts at zero and then rises, detailing the underlying KL‑divergence formulation, the single‑step update mechanism, and how gradients are preserved despite a zero scalar loss, with code examples from the trl implementation.

Baobao Algorithm Notes

Mar 19, 2025

Why Does GRPO Loss Start at Zero and Grow During OpenR1 Training?

Background

In the single‑step GRPO implementation used by OpenR1 and the trl library, the loss is defined as a scaled average KL‑divergence between the current policy model and a reference model. Training follows a one‑exploration‑step‑per‑iteration schedule and the reward is defined at the sequence level (a single scalar per generated output).

Why the loss is initialized to zero

At the very beginning of training the policy and reference models have identical parameters, therefore they produce identical probability distributions for every token. The KL‑divergence between identical distributions is zero, so the loss, which is β × (average KL‑divergence) , evaluates to zero.

Why the loss increases during training

As training proceeds the policy model is updated and gradually diverges from the reference model. This divergence raises the KL‑divergence term, causing the scalar loss value to grow. The increase of the scalar loss does not contradict gradient descent: the loss is used only to compute gradients, and the gradients remain non‑zero even when the loss value is zero. Parameter updates are driven by gradient × learning‑rate, not by the loss magnitude itself.

Gradient preservation in trl

The implementation keeps gradients alive by subtracting a detached copy of the logits before exponentiation. The relevant code is:

per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)

and the full loss computation:

# x - x.detach() allows for preserving gradients from x
advantages = inputs['advantages']
per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)
per_token_loss = -(per_token_loss - self.beta * per_token_kl)
loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()

Even when the scalar loss evaluates to zero, the gradient of per_token_loss with respect to the model parameters is non‑zero, enabling learning.

Reference

Further discussion and the original derivation are available in the OpenR1 GitHub issue: https://github.com/huggingface/open-r1/issues/239

reinforcement learning Policy Gradient GRPO Loss Initialization OpenR1 TRL

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.