Why DPO Treats LLMs as Q‑Functions: A Deep Theoretical Dive
This article offers a detailed theoretical interpretation of the DPO algorithm, showing how large language models can be viewed as Q‑functions, unifying sequence‑wise and step‑wise decision perspectives, and discussing the resulting implications for reinforcement‑learning‑based alignment research.
Symbol System Used in the Analysis
denotes a sentence composed of multiple tokens; denotes the token at position t.
is the language model obtained after the SFT stage; is the probability of sampling a token, representing the step‑wise token sampling probability.
is a scalar scoring function that abstracts human preference.
denotes the desired language‑model distribution.
is the actual LLM being trained, initialized as , with optimization objective .
Key Questions Addressed
Why can an LLM be regarded as a Q‑function rather than merely a reward model?
What practical significance does this perspective bring, and which new alignment research directions might it open?
DPO Background
The DPO loss is essentially a loss for training a reward model. The original paper, titled “Your Language Model is Secretly a Reward Model,” proposes that we can obtain the final LLM directly without first learning a separate reward model.
By defining the loss in this way, the LLM is obtained in a single step.
LLM Ratio as a Q‑Function
If the ultimate goal is only to obtain a reward function rather than a policy , is DPO still necessary? The answer is yes because the formulation yields a quantity that can be computed analytically without Monte‑Carlo estimation.
The key mathematical property is that the expectation has a closed‑form solution, leading to the definition of a new symbol (shown below).
When t = T‑1, the following holds:
For t < T‑1, the result follows by mathematical induction.
Physical Meaning of the Q‑Function
The derived quantity evaluates the log‑probability ratio of the first t tokens and outputs a score for the partial sentence. This score is exactly the Q‑function in soft‑RL, while the traditional reward model corresponds to the special case at t = T.
Training with DPO therefore yields a model that, when fed only the first half of a sentence, behaves as a per‑token Q‑function with exact theoretical guarantees.
Credit Assignment via the Q‑Function
During generation, the Q‑value estimates the expected final sentence score given the already generated prefix. As the prefix grows, the estimate becomes more accurate, reaching 100 % at the final token.
The contribution of a single token to the overall Q can be computed analytically:
The sum of all token credits equals the true sentence‑level reward, which matches the first experiment in the original DPO paper.
Reinforcement‑Learning Formalism
Modeling the LLM as an MDP, each token generation is an action, and the state is the current prefix. The episode ends after T tokens with a sparse reward only at the final step:
Defining value and Q functions leads to a Bellman equation that holds under dense‑reward (soft‑RL) and sparse‑reward (LLM) settings.
Unfolding the recursion from t = T down to t = 0 yields the closed‑form expression shown below, confirming that learning an exact leads to Bellman consistency.
Observed Log‑Probability Drop in DPO Training
Practitioners often notice that after DPO fine‑tuning, the log‑probability of the chosen response decreases, while the rejected response drops faster, causing the loss to keep decreasing.
Recent work titled “Noise‑Contrastive Alignment of Language Models with Explicit Rewards” (arXiv:2402.05369) derives a more general alignment method (NCA) that resolves this log‑probability issue.
Noise Contrastive Alignment of Language Models with Explicit Rewards
https://arxiv.org/abs/2402.05369Independent evaluation on the UltraInteract dataset shows NCA outperforming KTO and DPO on a 70B model, approaching GPT‑3.5‑level performance.
UltraInteract also provides a partially dense‑reward dataset, helping bridge the gap from sparse to dense reward settings.
References
Advancing LLM Reasoning Generalists with Preference Trees: https://arxiv.org/abs/2404.02078
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
