Group Policy Gradient: Direct Objective Optimization for Faster Reinforcement Learning
The article introduces Group Policy Gradient (GPG), a reinforcement‑learning framework that eliminates surrogate loss functions and critic models, directly optimizes the original objective, reduces bias and variance, and achieves state‑of‑the‑art performance on both single‑modal and multimodal tasks.
Background
Large language models (LLMs) are often fine‑tuned with reinforcement learning (RL) using algorithms such as PPO and GPPO. These methods rely on surrogate loss functions, a critic (value) network, and KL‑regularization, which increase training cost and introduce bias.
Motivation
The authors target two open problems: (1) eliminate surrogate objectives and directly optimize the original RL return; (2) simplify the training pipeline by removing the critic and KL constraints while preserving performance.
Group Policy Gradient (GPG)
GPG introduces a policy‑gradient method that operates on groups of trajectories. For each group g of size N, the mean reward is \bar r_g = (1/N) \sum_{i=1}^{N} r_i. The advantage for sample i is defined as A_i = r_i - \bar r_g. The gradient estimate for policy parameters \theta becomes
∇_θ J ≈ (1/N) Σ_i A_i ∇_θ log π_θ(a_i|s_i) · c_gwhere c_g is a dynamic correction factor that down‑weights groups containing only correct or only incorrect samples:
c_g = (N - N_invalid) / N N_invalidcounts groups with homogeneous outcomes. This factor automatically reduces the contribution of biased groups, mitigating gradient‑estimation bias.
Key components
Direct objective optimization : The loss is the negative expected return; no surrogate KL term is used.
Critic‑free architecture : Group‑level baselines replace the traditional value network, simplifying the training pipeline.
Accurate Gradient Estimation (AGE) : The correction factor c_g addresses the “all‑correct / all‑incorrect” bias that would otherwise produce zero gradients for homogeneous groups.
Bias analysis of prior methods
Existing approaches (PPO, GPPO, ReMax, GRPO) suffer from two main biases:
Advantage‑function bias : Baselines embed reward information, distorting the true advantage.
Gradient‑estimation bias from homogeneous groups : When a group’s samples are all successes or all failures, standard policy‑gradient estimates become zero, leading to under‑training.
Experiments
Single‑modal benchmarks
GPG was evaluated on standard language‑model fine‑tuning datasets (e.g., Alpaca, GSM‑8K). Compared with PPO, GPPO, ReMax and GRPO, GPG achieved higher final accuracy (e.g., 47.8 % vs. 43.9 % for the baseline) and converged in fewer epochs.
Multi‑modal benchmarks
On tasks that combine text and vision—mathematical reasoning with diagrams, visual question answering, and cross‑modal inference—GPG consistently outperformed the same baselines, establishing new state‑of‑the‑art results.
Results
Across all settings GPG improves accuracy by 3–4 percentage points, reduces training variance, and lowers computational overhead because no critic network is required.
Conclusion
GPG provides a simple, critic‑free reinforcement‑learning fine‑tuning pipeline that directly optimizes the original RL objective and automatically corrects gradient bias. The method scales to both single‑modal and multi‑modal LLMs and is released as open source.
Paper: https://arxiv.org/pdf/2504.02546
Code: https://github.com/AMAP-ML/GPG
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
