Baobao Algorithm Notes
Jul 16, 2023 · Artificial Intelligence
Why High RM Scores Don't Guarantee Better LLMs: 7 RLHF Tricks for Stable PPO Training
The article examines why rising RM scores in large‑model training don't ensure superior LLM performance and presents seven practical RLHF tricks—ranging from KL‑penalty to global gradient clipping—that improve PPO stability and reduce resource overhead.
LLM trainingPPORLHF
0 likes · 7 min read
