Mastering RLHF, DPO, and KTO: A Complete Guide to Human‑Feedback Alignment Techniques
This comprehensive guide explains the full RLHF training pipeline, the mathematical foundations of reward modeling and PPO, and introduces DPO and KTO algorithms—including their implementations, advantages, limitations, and practical tuning strategies—for building aligned large language models.
1. Complete RLHF Process
Reinforcement Learning from Human Feedback (RLHF) is a three‑stage training paradigm:
SFT (Supervised Fine‑Tuning) : The base model learns to follow instructions using standard language‑modeling loss.
Reward Model (RM) Training : A separate model learns a preference function P(y_1 > y_2 | x) from pairwise human feedback.
PPO Optimization : The policy model is refined with Proximal Policy Optimization, using the reward model as a critic while penalizing divergence from the reference model.
The following Python class illustrates the full pipeline:
class RLHFPipeline:
"""RLHF complete training pipeline"""
def __init__(self, base_model, tokenizer):
self.base_model = base_model
self.tokenizer = tokenizer
# three key models
self.sft_model = None # supervised fine‑tuned model
self.reward_model = None # reward model
self.ppo_model = None # PPO‑optimized model
def stage1_supervised_finetuning(self, instruction_dataset):
"""Stage 1: Supervised Fine‑Tuning (SFT)
Goal: Teach the base model basic instruction following.
"""
print("Stage 1: Supervised Fine‑Tuning…")
self.sft_model = copy.deepcopy(self.base_model)
for epoch in range(num_epochs):
for batch in instruction_dataset:
inputs = batch['instructions'] # e.g., "Explain machine learning"
targets = batch['responses'] # e.g., "Machine learning is a subfield of AI…"
outputs = self.sft_model(inputs, labels=targets)
loss = outputs.loss
loss.backward()
optimizer.step()
print(f"SFT completed, perplexity: {self.evaluate_perplexity(self.sft_model)}")
def stage2_reward_model_training(self, preference_dataset):
"""Stage 2: Reward Model Training
Goal: Learn human preferences to score answer quality.
"""
print("Stage 2: Training reward model…")
self.reward_model = RewardModel(self.sft_model)
for epoch in range(num_epochs):
for batch in preference_dataset:
prompts = batch['prompts'] # e.g., "Explain quantum computing"
chosen = batch['chosen'] # human‑preferred answer
rejected = batch['rejected'] # less‑preferred answer
reward_chosen = self.reward_model(prompts, chosen)
reward_rejected = self.reward_model(prompts, rejected)
loss = -torch.log(torch.sigmoid(reward_chosen - reward_rejected))
loss.backward()
optimizer.step()
print(f"Reward model trained, preference accuracy: {self.evaluate_preference_accuracy()}")
def stage3_ppo_optimization(self, prompts_dataset):
"""Stage 3: PPO Optimization
Goal: Use the reward model to improve the policy while staying close to the reference.
"""
print("Stage 3: PPO optimization…")
self.ppo_model = copy.deepcopy(self.sft_model)
reference_model = copy.deepcopy(self.sft_model)
for iteration in range(num_iterations):
experiences = self.collect_experiences(prompts_dataset)
self.update_policy_ppo(experiences, reference_model)
print("PPO optimization completed")
def collect_experiences(self, prompts):
"""Collect PPO training experiences"""
experiences = []
for prompt in prompts:
response = self.ppo_model.generate(prompt, do_sample=True)
reward = self.reward_model(prompt, response)
log_prob_current = self.ppo_model.get_log_prob(prompt, response)
log_prob_ref = self.reference_model.get_log_prob(prompt, response)
kl_penalty = self.kl_coeff * (log_prob_current - log_prob_ref)
final_reward = reward - kl_penalty
experiences.append({'prompt': prompt, 'response': response, 'reward': final_reward, 'log_prob': log_prob_current})
return experiencesThe objectives of each stage are summarized by the RLHFObjectives helper:
class RLHFObjectives:
@staticmethod
def explain_objectives():
objectives = {
'SFT': {'goal': 'Teach the model to follow instructions', 'input': 'instruction‑response pairs', 'loss': 'Cross‑Entropy', 'output': 'instruction‑following model'},
'RM': {'goal': 'Learn human preference', 'input': 'multiple responses per instruction + ranking', 'loss': 'Preference Ranking', 'output': 'reward model'},
'PPO': {'goal': 'Optimize policy for higher reward', 'input': 'instruction prompts', 'loss': 'PPO loss + KL regularization', 'output': 'aligned final model'}
}
for stage, info in objectives.items():
print(f"
{stage} stage:")
for key, value in info.items():
print(f" {key}: {value}")
RLHFObjectives.explain_objectives()2. Key Mathematical Foundations of RLHF
Reward Modeling learns a preference function using the Bradley‑Terry model:
class RewardModelMath:
@staticmethod
def bradley_terry_model():
"""P(A > B) = sigmoid(r_A - r_B)"""
pass
@staticmethod
def preference_probability(reward_A, reward_B):
return torch.sigmoid(reward_A - reward_B)
@staticmethod
def preference_loss(reward_chosen, reward_rejected):
return -torch.log(torch.sigmoid(reward_chosen - reward_rejected))
# Example
r_good = torch.tensor(2.0)
r_bad = torch.tensor(-1.0)
prob = RewardModelMath.preference_probability(r_good, r_bad)
loss = RewardModelMath.preference_loss(r_good, r_bad)
print(f"Good answer chosen probability: {prob:.3f}")
print(f"Training loss: {loss:.3f}")PPO Mathematics focuses on safe policy updates:
class PPOMath:
@staticmethod
def compute_ratio(log_prob_new, log_prob_old):
return torch.exp(log_prob_new - log_prob_old)
@staticmethod
def clipped_surrogate_objective(ratio, advantage, clip_ratio=0.2):
unclipped_obj = ratio * advantage
clipped_ratio = torch.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio)
clipped_obj = clipped_ratio * advantage
return torch.min(unclipped_obj, clipped_obj)
@staticmethod
def kl_regularization(log_prob_new, log_prob_old, kl_coeff=0.1):
kl_div = log_prob_old - log_prob_new
return kl_coeff * kl_div
@staticmethod
def ppo_loss(log_prob_new, log_prob_old, advantage, rewards, kl_coeff=0.1):
ratio = PPOMath.compute_ratio(log_prob_new, log_prob_old)
policy_loss = PPOMath.clipped_surrogate_objective(ratio, advantage)
kl_penalty = PPOMath.kl_regularization(log_prob_new, log_prob_old, kl_coeff)
return -policy_loss + kl_penalty
# Numerical example
log_prob_new = torch.tensor([-2.0, -1.5, -3.0])
log_prob_old = torch.tensor([-2.2, -1.8, -2.5])
advantage = torch.tensor([1.0, -0.5, 2.0])
ratio = PPOMath.compute_ratio(log_prob_new, log_prob_old)
clipped_obj = PPOMath.clipped_surrogate_objective(ratio, advantage)
print(f"Probability ratio: {ratio}")
print(f"Clipped objective: {clipped_obj}")3. Why DPO Does Not Need a Reward Model
DPO (Direct Preference Optimization) bypasses the reward‑model stage by directly optimizing the policy using implicit rewards derived from the policy‑reference log‑probability difference.
class DPOTheory:
"""Derivation of DPO"""
def theoretical_derivation(self):
"""
1. Standard RLHF objective: π* = argmaxπ E[r] - β·KL(π‖π_ref)
2. Closed‑form optimal policy: π*(y|x) ∝ π_ref(y|x)·exp(r/β)
3. Rearranging gives r = β·log(π*/π_ref) + β·log Z(x)
4. Preference probability becomes a sigmoid of the log‑ratio difference.
"""
pass
def dpo_loss(self, policy_model, reference_model, batch):
prompts = batch['prompts']
chosen = batch['chosen']
rejected = batch['rejected']
log_pi_chosen = policy_model.get_log_prob(prompts, chosen)
log_pi_rejected = policy_model.get_log_prob(prompts, rejected)
log_ref_chosen = reference_model.get_log_prob(prompts, chosen)
log_ref_rejected = reference_model.get_log_prob(prompts, rejected)
implicit_reward_chosen = self.beta * (log_pi_chosen - log_ref_chosen)
implicit_reward_rejected = self.beta * (log_pi_rejected - log_ref_rejected)
loss = -torch.log(torch.sigmoid(implicit_reward_chosen - implicit_reward_rejected))
return loss.mean()
def compare_with_ppo(self):
comparison = {
'Training stages': {'PPO': 'SFT → RM → PPO (3 stages)', 'DPO': 'SFT → DPO (2 stages)'},
'Data required': {'PPO': 'instruction + preference + online prompts', 'DPO': 'instruction + preference'},
'Complexity': {'PPO': 'high (online generation, reward eval)', 'DPO': 'medium (offline)'}
}
for aspect, details in comparison.items():
print(f"
{aspect}:")
for method, desc in details.items():
print(f" {method}: {desc}")
class DPOImplementation(nn.Module):
"""Full DPO implementation"""
def __init__(self, model, reference_model, beta=0.1):
super().__init__()
self.model = model
self.reference_model = reference_model
self.beta = beta
for p in self.reference_model.parameters():
p.requires_grad = False
def forward(self, batch):
prompts = batch['prompt']
chosen = batch['chosen']
rejected = batch['rejected']
chosen_logps = self.get_log_probs(self.model, prompts, chosen)
rejected_logps = self.get_log_probs(self.model, prompts, rejected)
with torch.no_grad():
ref_chosen_logps = self.get_log_probs(self.reference_model, prompts, chosen)
ref_rejected_logps = self.get_log_probs(self.reference_model, prompts, rejected)
pi_logratios = chosen_logps - rejected_logps
ref_logratios = ref_chosen_logps - ref_rejected_logps
logits = self.beta * (pi_logratios - ref_logratios)
loss = -F.logsigmoid(logits).mean()
with torch.no_grad():
accuracy = (logits > 0).float().mean()
chosen_rewards = self.beta * (chosen_logps - ref_chosen_logps)
rejected_rewards = self.beta * (rejected_logps - ref_rejected_logps)
metrics = {
'loss': loss.item(),
'accuracy': accuracy.item(),
'chosen_rewards': chosen_rewards.mean().item(),
'rejected_rewards': rejected_rewards.mean().item(),
'reward_margin': (chosen_rewards - rejected_rewards).mean().item()
}
return loss, metrics
def get_log_probs(self, model, prompts, responses):
"""Compute token‑level log probabilities (simplified)"""
inputs = self.tokenize_batch(prompts, responses)
with torch.cuda.amp.autocast():
outputs = model(**inputs, use_cache=False)
log_probs = F.log_softmax(outputs.logits, dim=-1)
response_log_probs = self.extract_response_log_probs(log_probs, inputs['labels'])
return response_log_probs.sum(dim=-1)DPO Advantages
Simplified pipeline – no separate reward model.
More stable training; fewer hyper‑parameters.
Strong theoretical guarantees from optimal‑control analysis.
DPO Limitations
Only works with offline preference data; cannot explore online.
Performance may drop if test distribution diverges from training data.
Still relies on high‑quality pairwise annotations.
4. KTO: Kahneman‑Tversky Optimization
KTO applies prospect theory from behavioral economics to handle binary (thumbs‑up / thumbs‑down) feedback.
class KTOTheory:
def __init__(self, lambda_param=2.25, beta=0.1):
self.lambda_param = lambda_param # loss‑aversion coefficient
self.beta = beta # KL regularization coefficient
@staticmethod
def prospect_theory_value_function(x, reference_point=0):
if x >= reference_point:
return x - reference_point # gains
else:
return -self.lambda_param * (reference_point - x) # losses
def kto_objective(self, policy_logp, reference_logp, label):
"""Compute KTO loss for a single example.
label=1 → good answer, label=0 → bad answer.
"""
implicit_reward = self.beta * (policy_logp - reference_logp)
if label == 1:
utility = self.prospect_theory_value_function(implicit_reward, 0)
return utility # maximize
else:
utility = self.prospect_theory_value_function(implicit_reward, 0)
return -utility # minimize (loss‑aversion)
class KTOImplementation(nn.Module):
"""Full KTO implementation"""
def __init__(self, model, reference_model, lambda_param=2.25, beta=0.1):
super().__init__()
self.model = model
self.reference_model = reference_model
self.lambda_param = lambda_param
self.beta = beta
for p in self.reference_model.parameters():
p.requires_grad = False
def forward(self, batch):
prompts = batch['prompt']
responses = batch['response']
labels = batch['label'] # 1 = good, 0 = bad
policy_logps = self.get_log_probs(self.model, prompts, responses)
with torch.no_grad():
reference_logps = self.get_log_probs(self.reference_model, prompts, responses)
losses = []
rewards = []
for i in range(len(labels)):
implicit_reward = self.beta * (policy_logps[i] - reference_logps[i])
rewards.append(implicit_reward.item())
if labels[i] == 1:
loss = -torch.log(torch.sigmoid(implicit_reward))
else:
loss = -torch.log(torch.sigmoid(-self.lambda_param * implicit_reward))
losses.append(loss)
total_loss = torch.stack(losses).mean()
with torch.no_grad():
good_rewards = [r for r, l in zip(rewards, labels) if l == 1]
bad_rewards = [r for r, l in zip(rewards, labels) if l == 0]
metrics = {
'loss': total_loss.item(),
'good_reward_mean': np.mean(good_rewards) if good_rewards else 0,
'bad_reward_mean': np.mean(bad_rewards) if bad_rewards else 0,
'reward_separation': (np.mean(good_rewards) - np.mean(bad_rewards)) if good_rewards and bad_rewards else 0
}
return total_loss, metricsKTO vs DPO vs PPO – a concise comparison table:
comparison_table = {
'PPO': {'Data requirement': 'preference + reward model', 'Stages': 'SFT → RM → PPO', 'Theory': 'Reinforcement learning', 'Pros': 'Mature, reliable', 'Cons': 'Complex, sensitive to hyper‑params'},
'DPO': {'Data requirement': 'pairwise preference', 'Stages': 'SFT → DPO', 'Theory': 'Optimal‑control', 'Pros': 'Simpler, stable', 'Cons': 'Offline only, distribution‑sensitive'},
'KTO': {'Data requirement': 'binary feedback', 'Stages': 'SFT → KTO', 'Theory': 'Prospect theory', 'Pros': 'Low data cost, aligns with human intuition', 'Cons': 'Newer, limited empirical evidence'}
}
for aspect in ['Data requirement', 'Stages', 'Theory', 'Pros', 'Cons']:
print(f"
{aspect}:")
for alg in ['PPO', 'DPO', 'KTO']:
print(f" {alg}: {comparison_table[alg][aspect]}")5. Practical Algorithm‑Selection Decision Tree
The following selector recommends an alignment algorithm based on data type, resource constraints, and business needs.
class AlignmentAlgorithmSelector:
"""Selects the most suitable alignment algorithm for a project"""
def __init__(self):
self.decision_tree = self.build_decision_tree()
def build_decision_tree(self):
return {
'data_type': {
'pairwise_preferences': {
'resource_constraint': {'high': 'DPO', 'medium': 'DPO', 'low': 'PPO'}
},
'binary_feedback': {
'data_size': {'large': 'KTO', 'small': 'Convert to pairwise then DPO'}
},
'scalar_rewards': {
'model_type': {'online': 'PPO', 'offline': 'Reward‑based fine‑tuning'}
}
}
}
def recommend_algorithm(self, specs):
recommendations = []
data = self.analyze_data_characteristics(specs)
resources = self.analyze_resource_constraints(specs)
business = self.analyze_business_requirements(specs)
if data['has_pairwise_data'] and resources['gpu_hours'] < 100:
recommendations.append(('DPO', 0.9, 'Pairwise data with limited resources'))
if data['has_binary_feedback'] and data['data_size'] > 50000:
recommendations.append(('KTO', 0.85, 'Large binary‑feedback dataset'))
if business['need_online_learning']:
recommendations.append(('PPO', 0.8, 'Online learning required'))
return sorted(recommendations, key=lambda x: x[1], reverse=True)
def analyze_data_characteristics(self, specs):
return {
'has_pairwise_data': specs.get('pairwise_data', False),
'has_binary_feedback': specs.get('binary_feedback', False),
'has_scalar_rewards': specs.get('scalar_rewards', False),
'data_size': specs.get('data_size', 0)
}
def analyze_resource_constraints(self, specs):
return {
'gpu_hours': specs.get('gpu_budget', 0),
'memory_gb': specs.get('memory_limit', 0),
'training_time_days': specs.get('time_limit', 0)
}
def analyze_business_requirements(self, specs):
return {
'need_online_learning': specs.get('online_learning', False),
'safety_critical': specs.get('safety_critical', False),
'interpretability_needed': specs.get('interpretability', False),
'latency_sensitive': specs.get('latency_sensitive', False)
}6. Tuning Guides for Alignment Algorithms
Each algorithm has a set of recommended hyper‑parameters and practical tips.
class AlignmentTuningStrategies:
@staticmethod
def ppo_tuning_guide():
return {
'learning_rate': {'range': [1e-6, 5e-6], 'tip': '10× smaller than SFT', 'adaptive': 'adjust based on KL'},
'kl_coeff': {'range': [0.01, 0.2], 'tip': 'balance over‑fitting vs under‑fitting', 'adaptive': 'KL target'},
'batch_size': {'range': [16, 128], 'tip': 'smaller batches are more stable', 'constraint': 'GPU memory'},
'ppo_epochs': {'range': [2, 8], 'tip': 'too many epochs cause over‑fitting', 'early_stop': 'monitor KL'}
}
@staticmethod
def dpo_tuning_guide():
return {
'beta': {'range': [0.1, 0.5], 'tip': 'controls deviation from reference', 'effect': 'higher → conservative'},
'learning_rate': {'range': [5e-7, 2e-6], 'tip': 'DPO is very sensitive to LR', 'schedule': 'cosine decay'},
'data_ratio': {'chosen_rejected': '1:1', 'tip': 'keep preference data balanced', 'augmentation': 'can augment rejected samples'}
}
@staticmethod
def kto_tuning_guide():
return {
'lambda_param': {'range': [2.0, 3.0], 'tip': 'loss‑aversion coefficient, typical 2.25'},
'beta': {'range': [0.05, 0.2], 'tip': 'smaller than DPO because binary signal is weaker'},
'data_balance': {'good_bad_ratio': 'prefer balanced', 'minimum': '≥1000 samples per class', 'quality': 'quality > quantity'}
}7. Automated Hyper‑Parameter Search
A lightweight Bayesian‑style tuner demonstrates how to explore the hyper‑parameter space for DPO.
class HyperparameterTuner:
def __init__(self, algorithm='DPO'):
self.algorithm = algorithm
self.best_params = {}
self.search_history = []
def bayesian_search(self, param_space, eval_function, n_trials=50):
for trial in range(n_trials):
if trial == 0:
params = self.get_default_params()
else:
params = self.suggest_params(param_space, trial)
score = eval_function(params)
self.search_history.append((params, score))
if not self.best_params or score > self.best_params['score']:
self.best_params = {'params': params, 'score': score}
print(f"Trial {trial}: Score={score:.4f}, Params={params}")
return self.best_params
def suggest_params(self, param_space, trial):
best_so_far = max(self.search_history, key=lambda x: x[1])[0]
new_params = {}
for key, (low, high) in param_space.items():
noise = np.random.normal(0, (high - low) * 0.1)
new_params[key] = float(np.clip(best_so_far[key] + noise, low, high))
return new_params
def get_default_params(self):
defaults = {
'DPO': {'beta': 0.1, 'lr': 1e-6},
'PPO': {'kl_coeff': 0.1, 'lr': 2e-6, 'ppo_epochs': 4},
'KTO': {'lambda_param': 2.25, 'beta': 0.1, 'lr': 1e-6}
}
return defaults.get(self.algorithm, {})
# Example evaluation function (simulated)
def eval_dpo_params(params):
beta, lr = params['beta'], params['lr']
base_score = 0.75
beta_bonus = max(0, 0.1 - abs(beta - 0.1)) * 0.1
lr_bonus = max(0, 0.01 - abs(lr - 1e-6)) * 1000 * 0.05
score = base_score + beta_bonus + lr_bonus + np.random.normal(0, 0.02)
return max(0, min(1, score))
param_space = {'beta': (0.05, 0.5), 'lr': (1e-7, 5e-6)}
tuner = HyperparameterTuner('DPO')
best_result = tuner.bayesian_search(param_space, eval_dpo_params, n_trials=20)
print("
Best hyper‑parameter combination:")
print(f" Params: {best_result['params']}")
print(f" Score: {best_result['score']:.4f}")8. Summary of RLHF Stack Practical Takeaways
Algorithm‑Selection Principles
Choose the algorithm that matches the available data (pairwise → DPO/PPO, binary → KTO).
Prioritize resource‑friendly methods when compute budget is limited.
Align the algorithm with specific business requirements (online learning, safety‑critical, latency).
Experience Highlights
PPO : Most stable but hardest to tune; best for resource‑rich scenarios.
DPO : Engineering‑friendly, high cost‑effectiveness; currently the mainstream choice.
KTO : Lowest data barrier, ideal for quick validation; emerging with strong potential.
Engineering Recommendations
Start with a solid SFT model to ensure basic instruction following.
Focus on data quality; high‑quality annotations outweigh sheer quantity.
Iterate gradually—avoid attempting to solve everything in a single training run.
Build a comprehensive evaluation suite covering offline metrics and online user experience.
Common Pitfalls to Avoid
Ignoring distribution shift between training and production data.
Rushing hyper‑parameter tuning without systematic experimentation.
Assuming offline metrics directly translate to online performance.
Confusing alignment with usefulness; always validate real user impact.
In short, RLHF is not a silver bullet; it requires careful system design, data engineering, and iterative tuning to achieve reliable, user‑aligned language models.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
