How Direct Preference Optimization Simplifies LLM Alignment Without Reward Models

This article breaks down the mathematical derivation of Direct Preference Optimization (DPO), showing how it replaces the traditional RLHF‑PPO pipeline by directly training an alignment model from human preference data, eliminating the need for a separate reward model and simplifying the overall training process.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
How Direct Preference Optimization Simplifies LLM Alignment Without Reward Models

1. What DPO Tries to Achieve

To train a model that can understand human questions and produce satisfying answers, three capabilities are needed: a large knowledge base, the ability to recognize that a user is issuing a query or command, and alignment with human preferences. The article uses ChatGPT as an example to illustrate these steps.

The base model is trained on massive text, code, and math data to acquire general knowledge.

Fine‑tuning makes the model understand human instructions.

Reward‑model‑based reinforcement learning (RLHF‑PPO) aligns the model with human preferences.

2. Preference‑Alignment Objective

The overall objective for preference alignment is the same for both PPO and DPO. It involves a target model π, a reward model r, and a reference model π_ref. The goal is to maximize the expected reward while keeping the updated model close to the reference distribution.

3. Step 1 – Solving the Optimal Alignment Model from the Objective

3.1 Derivation Details

Starting from the total objective (Equation 11), the article rewrites it to isolate the term that depends on the alignment model. By introducing a partition function Z(x) and applying algebraic transformations, the optimal alignment model can be expressed in closed form as a softmax over the reward scores.

The derivation shows that, assuming a fixed reward function, the optimal policy π* satisfies:

Further simplifications lead to a final expression that only involves the reward model and the reference model.

3.2 Step 1 Summary

Start from the total preference‑alignment objective assuming a known reward function.

Derive the explicit optimal policy π* as a softmax of reward scores.

Recognize that the partition function does not depend on the policy, allowing it to be ignored in optimization.

4. Step 2 – Skipping the Reward Model

Directly using the optimal policy expression still requires estimating the partition function, which is computationally expensive because it needs many samples per prompt. To avoid this, DPO substitutes the optimal reward function with the alignment model itself, effectively removing the separate reward‑model training step.

Two data‑annotation schemes are considered:

For the pairwise case, the DPO loss becomes:

For the listwise case (K > 2), the loss extends to:

Both formulations allow training the alignment model directly from preference data without an explicit reward model, using a simple supervised‑fine‑tuning style update.

5. Full DPO Derivation Summary

Begin with the total preference‑alignment objective that assumes a known reward function.

Derive the optimal policy in closed form, introducing a partition function.

Replace the reward function with the alignment model itself, eliminating the need to train a separate reward model.

Show how pairwise (BT) and listwise (PT) preference data lead to concrete DPO loss functions.

Conclude that DPO achieves the same alignment goal as RLHF‑PPO while being computationally cheaper and simpler.

RLHFDPOpreference optimizationLLM alignment
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.