How GVPO Improves LLM Fine‑Tuning: Stable, Sample‑Rich Policy Optimization
The article introduces GVPO, a Group Variance Policy Optimization method that uniquely achieves KL‑constrained reward maximization, supports diverse sampling distributions, and resolves instability and inefficiency issues found in GRPO and traditional policy‑gradient approaches for large language model post‑training.
TL;DR
We propose GVPO, which uniquely achieves the KL‑constrained reward‑maximization optimum and supports diverse sampling, avoiding on‑policy and importance‑sampling problems.
Motivation
Inspired by DPO, we aim to apply KL‑constrained reward maximization in the GRPO setting where each prompt is sampled multiple times.
The analytical solution requires computing Z(x), an expectation over all possible y, which is intractable. We observe that if the sum of gradient coefficients for all samples of a prompt equals zero, Z(x) cancels out.
GVPO Formulation
Based on this insight we define Group Variance Policy Optimization (GVPO). We prove that GVPO possesses desirable physical properties; in particular its loss reduces to a mean‑squared‑error (MSE) between a predicted value and a target value.
Thus GVPO can be interpreted as an MSE loss where the prediction is the model’s estimate and the target is the true reward.
Theoretical Guarantees
We show that the unique optimal policy of GVPO coincides with the optimal solution of KL‑constrained reward maximization.
Theorem 1
Minimizing the GVPO objective yields a unique policy that exactly solves the KL‑constrained reward‑maximization problem.
Theorem 2
Minimizing the same objective yields a unique policy for any distribution satisfying the zero‑sum gradient condition; softmax‑decoded policies meet this condition, so GVPO supports a broad class of sampling distributions.
Theorem 3
Using the n‑step GVPO algorithm maximizes a trust‑region‑constrained objective, guaranteeing stable updates and allowing the final policy to remain aligned with the initial one after n steps.
Comparison with DPO
Both GVPO and DPO exploit the analytical solution of KL‑constrained reward maximization. DPO relies on a BT model to cancel Z(x), while GVPO leverages the zero‑sum gradient property, making it applicable to multi‑response scenarios.
Ensures stable optimization without excessive policy drift.
Simplifies a joint policy‑and‑reward optimization into a reward‑only problem.
Moreover, DPO may have multiple optimal solutions, whereas GVPO’s optimal solution is provably unique (Theorem 1).
Comparison with GRPO and Standard Policy‑Gradient Methods
We rewrite the GRPO loss and identify three components in GVPO’s loss: (1) advantage maximization, (2) a KL‑based trust‑region constraint, and (3) entropy regularization that balances exploration and exploitation.
GVPO avoids on‑policy sampling inefficiencies and the variance introduced by importance‑sampling corrections used in PPO/GRPO.
Algorithm Overview
The GVPO update aligns each step with the policy from the previous step; after n steps the policy remains aligned with the initial one.
Implementation in the verl framework requires only a few code modifications, as shown below.
Paper: GVPO: Group Variance Policy Optimization for Large Language Model Post-Training
Link: https://arxiv.org/abs/2504.19599Conclusion
GVPO introduces a simple MSE‑based loss that provides a unique optimal solution, stable training, and the ability to sample from richer distributions without on‑policy constraints. The method integrates easily into existing GRPO pipelines.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
