Why Does GRPO Loss Start at Zero and Grow During OpenR1 Training?
The article explains why the GRPO loss in OpenR1 and trl starts at zero and then rises, detailing the underlying KL‑divergence formulation, the single‑step update mechanism, and how gradients are preserved despite a zero scalar loss, with code examples from the trl implementation.
Background
In the single‑step GRPO implementation used by OpenR1 and the trl library, the loss is defined as a scaled average KL‑divergence between the current policy model and a reference model. Training follows a one‑exploration‑step‑per‑iteration schedule and the reward is defined at the sequence level (a single scalar per generated output).
Why the loss is initialized to zero
At the very beginning of training the policy and reference models have identical parameters, therefore they produce identical probability distributions for every token. The KL‑divergence between identical distributions is zero, so the loss, which is β × (average KL‑divergence) , evaluates to zero.
Why the loss increases during training
As training proceeds the policy model is updated and gradually diverges from the reference model. This divergence raises the KL‑divergence term, causing the scalar loss value to grow. The increase of the scalar loss does not contradict gradient descent: the loss is used only to compute gradients, and the gradients remain non‑zero even when the loss value is zero. Parameter updates are driven by gradient × learning‑rate, not by the loss magnitude itself.
Gradient preservation in trl
The implementation keeps gradients alive by subtracting a detached copy of the logits before exponentiation. The relevant code is:
per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)and the full loss computation:
# x - x.detach() allows for preserving gradients from x
advantages = inputs['advantages']
per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)
per_token_loss = -(per_token_loss - self.beta * per_token_kl)
loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()Even when the scalar loss evaluates to zero, the gradient of per_token_loss with respect to the model parameters is non‑zero, enabling learning.
Reference
Further discussion and the original derivation are available in the OpenR1 GitHub issue: https://github.com/huggingface/open-r1/issues/239
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
