Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls
This article analytically explores the implicit assumptions behind the RLHF optimization objective, examines how they limit DPO and PPO methods, and proposes practical improvements such as rejection sampling and online on‑policy data selection to narrow the gap between theory and practice.
Hidden Assumptions of the RLHF Objective
The original RLHF objective assumes (1) a perfect, fully generalizable reward function r and (2) on‑policy sampling: the response y is drawn from the exact distribution induced by the current policy for a given prompt x.
Exhaustive Enumeration
Conceptually consider the set of all possible output distributions π(y|x). For each distribution evaluate the RLHF objective while keeping the reward function fixed. Algebraic manipulation yields an explicit solution that involves a partition function Z(x) to normalize the distribution.
Practical Issues
Enumerating all distributions is computationally infeasible.
Most sampled y would lie outside the high‑probability region of the true distribution, making the process inefficient.
Rejection Sampling
Restrict the proposal distribution to tightly cover the target region, turning the process into rejection sampling.
Summary of Assumptions
Assumption 1: Data collection and distribution – the sampled y must come from the same distribution that the objective evaluates (on‑policy).
Assumption 2: Reward function generalization – the reward function is perfect and unbiased, which rarely holds in practice.
Problems with Direct Preference Optimization (DPO)
DPO derives a loss from the explicit solution and assumes an optimal reward function. In practice DPO training data are off‑policy (human‑labelled, SFT‑generated, or synthetic), so they may not cover the true distribution.
Data Distribution Shift
Unexpected shift : training data lie outside the true distribution. Mitigated by applying rejection sampling (RSO variant) to select on‑policy samples.
Expected shift : even deliberately collected data may miss high‑probability regions, causing the model to ignore desired behavior.
Reward/Loss Limitations
During DPO training the gradients satisfy
∂loss/∂reject / ∂loss/∂chosen = π(chosen) / π(reject)This implies that if the gradient pulling the reject probability down outweighs the gradient pulling the chosen probability up, both rewards can decrease simultaneously.
Mitigation strategies include clipping the reject reward/probability, adding baselines, or regularization terms.
Summary
DPO fits observed off‑policy data to approximate the optimal distribution, introducing a gap from the original RLHF objective.
Imperfect reward models cause the model to learn “what not to do” better than “what to do”.
Loss modifications (baseline, clipping, regularization) can alleviate these issues.
Problems with Proximal Policy Optimization (PPO)
PPO follows the on‑policy premise of the RLHF objective but still relies on a fixed reward model. If the reward model is biased or trained on out‑of‑distribution data, PPO receives misleading feedback.
Exploration vs. Exploitation
PPO’s on‑policy updates enable exploration of new responses, whereas DPO’s off‑policy updates focus on exploiting existing preference pairs.
Online + On‑Policy Framework
Combining online data generation with on‑policy sampling can reduce the gap between practice and the RLHF objective. The policy generates new data, which are then labelled (human or AI) and fed back into training.
Active‑learning criteria suggest selecting data points with maximum uncertainty, e.g.:
Extreme reward scores (very high or very low) indicating possible bias.
Large distribution shifts between successive iterations, measured by KL divergence or other distance metrics.
These ideas are formalized in the paper https://arxiv.org/pdf/2312.11456 and related works such as https://arxiv.org/pdf/2404.04626, https://arxiv.org/pdf/2309.06657, and https://arxiv.org/pdf/2210.10760.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
