Artificial Intelligence 9 min read

How GVPO Improves LLM Fine‑Tuning: Stable, Sample‑Rich Policy Optimization

The article introduces GVPO, a Group Variance Policy Optimization method that uniquely achieves KL‑constrained reward maximization, supports diverse sampling distributions, and resolves instability and inefficiency issues found in GRPO and traditional policy‑gradient approaches for large language model post‑training.

Baobao Algorithm Notes

Jun 13, 2025

How GVPO Improves LLM Fine‑Tuning: Stable, Sample‑Rich Policy Optimization

TL;DR

We propose GVPO, which uniquely achieves the KL‑constrained reward‑maximization optimum and supports diverse sampling, avoiding on‑policy and importance‑sampling problems.

Motivation

Inspired by DPO, we aim to apply KL‑constrained reward maximization in the GRPO setting where each prompt is sampled multiple times.

The analytical solution requires computing Z(x), an expectation over all possible y, which is intractable. We observe that if the sum of gradient coefficients for all samples of a prompt equals zero, Z(x) cancels out.

GVPO Formulation

Based on this insight we define Group Variance Policy Optimization (GVPO). We prove that GVPO possesses desirable physical properties; in particular its loss reduces to a mean‑squared‑error (MSE) between a predicted value and a target value.

Thus GVPO can be interpreted as an MSE loss where the prediction is the model’s estimate and the target is the true reward.

Theoretical Guarantees

We show that the unique optimal policy of GVPO coincides with the optimal solution of KL‑constrained reward maximization.

Theorem 1

Minimizing the GVPO objective yields a unique policy that exactly solves the KL‑constrained reward‑maximization problem.

Theorem 2

Minimizing the same objective yields a unique policy for any distribution satisfying the zero‑sum gradient condition; softmax‑decoded policies meet this condition, so GVPO supports a broad class of sampling distributions.

Theorem 3

Using the n‑step GVPO algorithm maximizes a trust‑region‑constrained objective, guaranteeing stable updates and allowing the final policy to remain aligned with the initial one after n steps.

Comparison with DPO

Both GVPO and DPO exploit the analytical solution of KL‑constrained reward maximization. DPO relies on a BT model to cancel Z(x), while GVPO leverages the zero‑sum gradient property, making it applicable to multi‑response scenarios.

Ensures stable optimization without excessive policy drift.

Simplifies a joint policy‑and‑reward optimization into a reward‑only problem.

Moreover, DPO may have multiple optimal solutions, whereas GVPO’s optimal solution is provably unique (Theorem 1).

Comparison with GRPO and Standard Policy‑Gradient Methods

We rewrite the GRPO loss and identify three components in GVPO’s loss: (1) advantage maximization, (2) a KL‑based trust‑region constraint, and (3) entropy regularization that balances exploration and exploitation.

GVPO avoids on‑policy sampling inefficiencies and the variance introduced by importance‑sampling corrections used in PPO/GRPO.

Algorithm Overview

The GVPO update aligns each step with the policy from the previous step; after n steps the policy remains aligned with the initial one.

Implementation in the verl framework requires only a few code modifications, as shown below.

Paper: GVPO: Group Variance Policy Optimization for Large Language Model Post-Training
Link: https://arxiv.org/abs/2504.19599

Conclusion

GVPO introduces a simple MSE‑based loss that provides a unique optimal solution, stable training, and the ability to sample from richer distributions without on‑policy constraints. The method integrates easily into existing GRPO pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

policy optimization GVPO KL constraint sampling diversity stable training

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.