Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls
This article analytically explores the implicit assumptions behind the RLHF optimization objective, examines how they limit DPO and PPO methods, and proposes practical improvements such as rejection sampling and online on‑policy data selection to narrow the gap between theory and practice.
