Artificial Intelligence 22 min read

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

This article analytically explores the implicit assumptions behind the RLHF optimization objective, examines how they limit DPO and PPO methods, and proposes practical improvements such as rejection sampling and online on‑policy data selection to narrow the gap between theory and practice.

Baobao Algorithm Notes

Oct 22, 2024

Uncovering Hidden Assumptions in RLHF: Theory, DPO & PPO Pitfalls

Hidden Assumptions of the RLHF Objective

The original RLHF objective assumes (1) a perfect, fully generalizable reward function r and (2) on‑policy sampling: the response y is drawn from the exact distribution induced by the current policy for a given prompt x.

Exhaustive Enumeration

Conceptually consider the set of all possible output distributions π(y|x). For each distribution evaluate the RLHF objective while keeping the reward function fixed. Algebraic manipulation yields an explicit solution that involves a partition function Z(x) to normalize the distribution.

Practical Issues

Enumerating all distributions is computationally infeasible.

Most sampled y would lie outside the high‑probability region of the true distribution, making the process inefficient.

Rejection Sampling

Restrict the proposal distribution to tightly cover the target region, turning the process into rejection sampling.

Summary of Assumptions

Assumption 1: Data collection and distribution – the sampled y must come from the same distribution that the objective evaluates (on‑policy).

Assumption 2: Reward function generalization – the reward function is perfect and unbiased, which rarely holds in practice.

Problems with Direct Preference Optimization (DPO)

DPO derives a loss from the explicit solution and assumes an optimal reward function. In practice DPO training data are off‑policy (human‑labelled, SFT‑generated, or synthetic), so they may not cover the true distribution.

Data Distribution Shift

Unexpected shift : training data lie outside the true distribution. Mitigated by applying rejection sampling (RSO variant) to select on‑policy samples.

Expected shift : even deliberately collected data may miss high‑probability regions, causing the model to ignore desired behavior.

Reward/Loss Limitations

During DPO training the gradients satisfy

∂loss/∂reject  /  ∂loss/∂chosen  =  π(chosen) / π(reject)

This implies that if the gradient pulling the reject probability down outweighs the gradient pulling the chosen probability up, both rewards can decrease simultaneously.

Mitigation strategies include clipping the reject reward/probability, adding baselines, or regularization terms.

Summary

DPO fits observed off‑policy data to approximate the optimal distribution, introducing a gap from the original RLHF objective.

Imperfect reward models cause the model to learn “what not to do” better than “what to do”.

Loss modifications (baseline, clipping, regularization) can alleviate these issues.

Problems with Proximal Policy Optimization (PPO)

PPO follows the on‑policy premise of the RLHF objective but still relies on a fixed reward model. If the reward model is biased or trained on out‑of‑distribution data, PPO receives misleading feedback.

Exploration vs. Exploitation

PPO’s on‑policy updates enable exploration of new responses, whereas DPO’s off‑policy updates focus on exploiting existing preference pairs.

Online + On‑Policy Framework

Combining online data generation with on‑policy sampling can reduce the gap between practice and the RLHF objective. The policy generates new data, which are then labelled (human or AI) and fed back into training.

Active‑learning criteria suggest selecting data points with maximum uncertainty, e.g.:

Extreme reward scores (very high or very low) indicating possible bias.

Large distribution shifts between successive iterations, measured by KL divergence or other distance metrics.

These ideas are formalized in the paper https://arxiv.org/pdf/2312.11456 and related works such as https://arxiv.org/pdf/2404.04626, https://arxiv.org/pdf/2309.06657, and https://arxiv.org/pdf/2210.10760.

RLHF PPO AI alignment DPO reward modeling Rejection Sampling

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.