Artificial Intelligence 7 min read

Group Policy Gradient: Direct Objective Optimization for Faster Reinforcement Learning

The article introduces Group Policy Gradient (GPG), a reinforcement‑learning framework that eliminates surrogate loss functions and critic models, directly optimizes the original objective, reduces bias and variance, and achieves state‑of‑the‑art performance on both single‑modal and multimodal tasks.

Amap Tech

May 19, 2025

Group Policy Gradient: Direct Objective Optimization for Faster Reinforcement Learning

Background

Large language models (LLMs) are often fine‑tuned with reinforcement learning (RL) using algorithms such as PPO and GPPO. These methods rely on surrogate loss functions, a critic (value) network, and KL‑regularization, which increase training cost and introduce bias.

Motivation

The authors target two open problems: (1) eliminate surrogate objectives and directly optimize the original RL return; (2) simplify the training pipeline by removing the critic and KL constraints while preserving performance.

Group Policy Gradient (GPG)

GPG introduces a policy‑gradient method that operates on groups of trajectories. For each group g of size N, the mean reward is \bar r_g = (1/N) \sum_{i=1}^{N} r_i. The advantage for sample i is defined as A_i = r_i - \bar r_g. The gradient estimate for policy parameters \theta becomes

∇_θ J ≈ (1/N) Σ_i A_i ∇_θ log π_θ(a_i|s_i) · c_g

where c_g is a dynamic correction factor that down‑weights groups containing only correct or only incorrect samples:

c_g = (N - N_invalid) / N

N_invalid

counts groups with homogeneous outcomes. This factor automatically reduces the contribution of biased groups, mitigating gradient‑estimation bias.

Key components

Direct objective optimization : The loss is the negative expected return; no surrogate KL term is used.

Critic‑free architecture : Group‑level baselines replace the traditional value network, simplifying the training pipeline.

Accurate Gradient Estimation (AGE) : The correction factor c_g addresses the “all‑correct / all‑incorrect” bias that would otherwise produce zero gradients for homogeneous groups.

Bias analysis of prior methods

Existing approaches (PPO, GPPO, ReMax, GRPO) suffer from two main biases:

Advantage‑function bias : Baselines embed reward information, distorting the true advantage.

Gradient‑estimation bias from homogeneous groups : When a group’s samples are all successes or all failures, standard policy‑gradient estimates become zero, leading to under‑training.

Experiments

Single‑modal benchmarks

GPG was evaluated on standard language‑model fine‑tuning datasets (e.g., Alpaca, GSM‑8K). Compared with PPO, GPPO, ReMax and GRPO, GPG achieved higher final accuracy (e.g., 47.8 % vs. 43.9 % for the baseline) and converged in fewer epochs.

Multi‑modal benchmarks

On tasks that combine text and vision—mathematical reasoning with diagrams, visual question answering, and cross‑modal inference—GPG consistently outperformed the same baselines, establishing new state‑of‑the‑art results.

Results

Across all settings GPG improves accuracy by 3–4 percentage points, reduces training variance, and lowers computational overhead because no critic network is required.

Conclusion

GPG provides a simple, critic‑free reinforcement‑learning fine‑tuning pipeline that directly optimizes the original RL objective and automatically corrects gradient bias. The method scales to both single‑modal and multi‑modal LLMs and is released as open source.

Paper: https://arxiv.org/pdf/2504.02546

Code: https://github.com/AMAP-ML/GPG

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

reinforcement learning AI research LLM fine-tuning policy gradient bias reduction group policy gradient objective optimization

Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.