Artificial Intelligence 7 min read

Why RLHF Is Irreplaceable: Uncovering the Limits of SFT

The article analyzes why supervised fine‑tuning (SFT) cannot replace reinforcement learning from human feedback (RLHF), highlighting SFT's lack of negative feedback and backward‑looking capability, and explains how RLHF’s reward model addresses these fundamental shortcomings.

NewBeeNLP

Sep 5, 2024

Why RLHF Is Irreplaceable: Uncovering the Limits of SFT

SFT Lacks Negative Feedback

Supervised fine‑tuning (SFT) trains a model to predict the conditional probability Prob(E | ABCD), so it only learns which next token is correct, never which token is wrong. Consequently, the model cannot receive explicit negative signals, and it must rely on “isolating” undesirable tokens by over‑exposing correct ones.

This limitation explains why diverse SFT data is crucial: by feeding many correct tokens, the model indirectly reduces the probability of unseen wrong tokens. However, without direct negative feedback, the model may still increase the likelihood of harmful tokens when they appear in the training distribution.

An experiment with the qwen2-0.5B model inserted unseen special tokens during SFT. The model correctly generated the expected token <reserved_2> but demonstrated that it cannot inherently understand that a token like "not" should be avoided in certain contexts.

Training corpus: <reserved_1>最喜欢的人是<reserved_2> Prediction corpus: <reserved_1>最讨厌的人是 Because the model only sees high‑probability tokens, it lacks a mechanism to penalize the generation of undesirable ones, which is why RLHF’s reward model acts like a “coach” that punishes forbidden tokens and rewards good ones.

SFT Lacks “Backward‑Looking” Ability

During SFT each token can only attend to previous tokens, so the model cannot use later context to correct earlier predictions. For example, in the statement “Taiwan is not China,” the model will increase Prob(中国 | 台湾不是) despite the later negation, because it never sees the full sentence.

RLHF (or DPO) evaluates the entire sentence with a reward model, allowing the model to adjust token probabilities based on the whole context. This effectively turns the loss from a simple average (SFT) into a weighted loss that can prioritize correcting critical tokens.

Conclusion

Unless SFT training is fundamentally changed—e.g., by assigning token‑wise losses instead of averaging—the inherent limitations of SFT make RLHF an indispensable complement for safety‑critical and alignment tasks. While SFT can be combined with a reward model, the lack of negative feedback and backward context in SFT means RLHF remains irreplaceable for now.

SFT RLHF language models reward modeling Training Methods

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.