Why High RM Scores Don't Guarantee Better LLMs: 7 RLHF Tricks for Stable PPO Training
The article examines why rising RM scores in large‑model training don't ensure superior LLM performance and presents seven practical RLHF tricks—ranging from KL‑penalty to global gradient clipping—that improve PPO stability and reduce resource overhead.
Background
Higher Reward Model (RM) scores do not automatically imply a better final language model. The difficulty lies in the hidden pitfalls of Reinforcement Learning from Human Feedback (RLHF) training.
Training cost and instability
Running Proximal Policy Optimization (PPO) on a 7B LLaMA model requires roughly twice the memory of standard supervised fine‑tuning because separate policy, critic, reference, and reward models must be kept in GPU memory. In practice this forces the use of 80 GB A100 GPUs; scaling to larger models quickly becomes financially prohibitive.
RLHF training is also unstable: models may diverge into endless repetition, produce nonsensical continuations, or collapse to a single end‑of‑sentence token. These failure modes are analogous to the extreme behaviors observed in poorly tuned reinforcement‑learning agents.
Why PPO is hard
Effective PPO training demands high‑quality supervised‑fine‑tuning (SFT) data, carefully curated reward prompts, and precise hyper‑parameter settings. Missing any of these ingredients often leads to training collapse.
Fudan technical report – 7 RLHF stability tricks
The report (code available at https://github.com/OpenLMLab/MOSS-RLHF) proposes seven concrete techniques that improve PPO stability.
Token‑level KL‑divergence penalty – penalizes large deviations from a reference model to keep updates stable.
kl_penalty = (-self.kl_penalty_weight * (logprobs - ref_logprobs)).cpu()Reward normalization and clipping – caps extreme reward values and normalizes them to a stable range.
self.use_reward_clip: bool = opt.use_reward_clip
self.use_reward_norm: bool = opt.use_reward_norm
self.use_advantage_norm: bool = opt.use_advantage_norm
self.use_advantage_clip: bool = opt.use_advantage_clip
self.use_critic_loss_clip: bool = opt.use_critic_loss_clip
self.use_policy_loss_clip: bool = opt.use_policy_loss_clipValue‑function loss clipping – limits the magnitude of value‑function updates, similar to gradient clipping.
Critic model initialization – instead of initializing the critic with the reward model, pre‑train a dedicated critic model for more reliable value estimates.
Generalized Advantage Estimation (GAE) – the report’s appendix C.3 contains detailed GAE hyper‑parameter experiments.
Clipped surrogate objective – adds an extra regularization term to prevent overly large policy updates, improving efficiency over vanilla policy‑gradient.
if self.use_entropy_loss:
loss1 = pg_loss + self.vf_loss_weight * vf_loss + self.entropy_loss_weight * entro_loss
else:
loss1 = pg_loss + self.vf_loss_weight * vf_loss
loss2 = self.ppo_pretrain_loss_weight * pretrain_loss
loss = loss1 + loss2Global gradient clipping – uniformly caps gradient norms across the entire network, reinforcing the stability benefits of the other clipping techniques.
In addition, the authors incorporate an in‑struct‑GPT style pre‑training loss ( llm_pretrain_loss) during RLHF to further guide the model.
These seven tricks focus on stabilizing PPO training through careful clipping, proper initialization, and loss redesign. Practitioners are encouraged to explore the open‑source implementation for hands‑on experimentation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
