Why High RM Scores Don't Guarantee Better LLMs: 7 RLHF Tricks for Stable PPO Training

The article examines why rising RM scores in large‑model training don't ensure superior LLM performance and presents seven practical RLHF tricks—ranging from KL‑penalty to global gradient clipping—that improve PPO stability and reduce resource overhead.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Why High RM Scores Don't Guarantee Better LLMs: 7 RLHF Tricks for Stable PPO Training

Background

Higher Reward Model (RM) scores do not automatically imply a better final language model. The difficulty lies in the hidden pitfalls of Reinforcement Learning from Human Feedback (RLHF) training.

Training cost and instability

Running Proximal Policy Optimization (PPO) on a 7B LLaMA model requires roughly twice the memory of standard supervised fine‑tuning because separate policy, critic, reference, and reward models must be kept in GPU memory. In practice this forces the use of 80 GB A100 GPUs; scaling to larger models quickly becomes financially prohibitive.

RLHF training is also unstable: models may diverge into endless repetition, produce nonsensical continuations, or collapse to a single end‑of‑sentence token. These failure modes are analogous to the extreme behaviors observed in poorly tuned reinforcement‑learning agents.

Why PPO is hard

Effective PPO training demands high‑quality supervised‑fine‑tuning (SFT) data, carefully curated reward prompts, and precise hyper‑parameter settings. Missing any of these ingredients often leads to training collapse.

Fudan technical report – 7 RLHF stability tricks

The report (code available at https://github.com/OpenLMLab/MOSS-RLHF) proposes seven concrete techniques that improve PPO stability.

Token‑level KL‑divergence penalty – penalizes large deviations from a reference model to keep updates stable.

kl_penalty = (-self.kl_penalty_weight * (logprobs - ref_logprobs)).cpu()

Reward normalization and clipping – caps extreme reward values and normalizes them to a stable range.

self.use_reward_clip: bool = opt.use_reward_clip
self.use_reward_norm: bool = opt.use_reward_norm
self.use_advantage_norm: bool = opt.use_advantage_norm
self.use_advantage_clip: bool = opt.use_advantage_clip
self.use_critic_loss_clip: bool = opt.use_critic_loss_clip
self.use_policy_loss_clip: bool = opt.use_policy_loss_clip

Value‑function loss clipping – limits the magnitude of value‑function updates, similar to gradient clipping.

Critic model initialization – instead of initializing the critic with the reward model, pre‑train a dedicated critic model for more reliable value estimates.

Generalized Advantage Estimation (GAE) – the report’s appendix C.3 contains detailed GAE hyper‑parameter experiments.

Clipped surrogate objective – adds an extra regularization term to prevent overly large policy updates, improving efficiency over vanilla policy‑gradient.

if self.use_entropy_loss:
    loss1 = pg_loss + self.vf_loss_weight * vf_loss + self.entropy_loss_weight * entro_loss
else:
    loss1 = pg_loss + self.vf_loss_weight * vf_loss
loss2 = self.ppo_pretrain_loss_weight * pretrain_loss
loss = loss1 + loss2

Global gradient clipping – uniformly caps gradient norms across the entire network, reinforcing the stability benefits of the other clipping techniques.

In addition, the authors incorporate an in‑struct‑GPT style pre‑training loss ( llm_pretrain_loss) during RLHF to further guide the model.

These seven tricks focus on stabilizing PPO training through careful clipping, proper initialization, and loss redesign. Practitioners are encouraged to explore the open‑source implementation for hands‑on experimentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

artificial intelligenceRLHFLLM trainingPPOStability tricks
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.