Artificial Intelligence 32 min read

Mastering RLHF, DPO, and KTO: A Complete Guide to Human‑Feedback Alignment Techniques

This comprehensive guide explains the full RLHF training pipeline, the mathematical foundations of reward modeling and PPO, and introduces DPO and KTO algorithms—including their implementations, advantages, limitations, and practical tuning strategies—for building aligned large language models.

Wu Shixiong's Large Model Academy

Aug 26, 2025

Mastering RLHF, DPO, and KTO: A Complete Guide to Human‑Feedback Alignment Techniques

1. Complete RLHF Process

Reinforcement Learning from Human Feedback (RLHF) is a three‑stage training paradigm:

SFT (Supervised Fine‑Tuning) : The base model learns to follow instructions using standard language‑modeling loss.

Reward Model (RM) Training : A separate model learns a preference function P(y_1 > y_2 | x) from pairwise human feedback.

PPO Optimization : The policy model is refined with Proximal Policy Optimization, using the reward model as a critic while penalizing divergence from the reference model.

The following Python class illustrates the full pipeline:

class RLHFPipeline:
    """RLHF complete training pipeline"""
    def __init__(self, base_model, tokenizer):
        self.base_model = base_model
        self.tokenizer = tokenizer
        # three key models
        self.sft_model = None  # supervised fine‑tuned model
        self.reward_model = None  # reward model
        self.ppo_model = None  # PPO‑optimized model

    def stage1_supervised_finetuning(self, instruction_dataset):
        """Stage 1: Supervised Fine‑Tuning (SFT)
        Goal: Teach the base model basic instruction following.
        """
        print("Stage 1: Supervised Fine‑Tuning…")
        self.sft_model = copy.deepcopy(self.base_model)
        for epoch in range(num_epochs):
            for batch in instruction_dataset:
                inputs = batch['instructions']   # e.g., "Explain machine learning"
                targets = batch['responses']      # e.g., "Machine learning is a subfield of AI…"
                outputs = self.sft_model(inputs, labels=targets)
                loss = outputs.loss
                loss.backward()
                optimizer.step()
        print(f"SFT completed, perplexity: {self.evaluate_perplexity(self.sft_model)}")

    def stage2_reward_model_training(self, preference_dataset):
        """Stage 2: Reward Model Training
        Goal: Learn human preferences to score answer quality.
        """
        print("Stage 2: Training reward model…")
        self.reward_model = RewardModel(self.sft_model)
        for epoch in range(num_epochs):
            for batch in preference_dataset:
                prompts = batch['prompts']          # e.g., "Explain quantum computing"
                chosen = batch['chosen']            # human‑preferred answer
                rejected = batch['rejected']        # less‑preferred answer
                reward_chosen = self.reward_model(prompts, chosen)
                reward_rejected = self.reward_model(prompts, rejected)
                loss = -torch.log(torch.sigmoid(reward_chosen - reward_rejected))
                loss.backward()
                optimizer.step()
        print(f"Reward model trained, preference accuracy: {self.evaluate_preference_accuracy()}")

    def stage3_ppo_optimization(self, prompts_dataset):
        """Stage 3: PPO Optimization
        Goal: Use the reward model to improve the policy while staying close to the reference.
        """
        print("Stage 3: PPO optimization…")
        self.ppo_model = copy.deepcopy(self.sft_model)
        reference_model = copy.deepcopy(self.sft_model)
        for iteration in range(num_iterations):
            experiences = self.collect_experiences(prompts_dataset)
            self.update_policy_ppo(experiences, reference_model)
        print("PPO optimization completed")

    def collect_experiences(self, prompts):
        """Collect PPO training experiences"""
        experiences = []
        for prompt in prompts:
            response = self.ppo_model.generate(prompt, do_sample=True)
            reward = self.reward_model(prompt, response)
            log_prob_current = self.ppo_model.get_log_prob(prompt, response)
            log_prob_ref = self.reference_model.get_log_prob(prompt, response)
            kl_penalty = self.kl_coeff * (log_prob_current - log_prob_ref)
            final_reward = reward - kl_penalty
            experiences.append({'prompt': prompt, 'response': response, 'reward': final_reward, 'log_prob': log_prob_current})
        return experiences

The objectives of each stage are summarized by the RLHFObjectives helper:

class RLHFObjectives:
    @staticmethod
    def explain_objectives():
        objectives = {
            'SFT': {'goal': 'Teach the model to follow instructions', 'input': 'instruction‑response pairs', 'loss': 'Cross‑Entropy', 'output': 'instruction‑following model'},
            'RM': {'goal': 'Learn human preference', 'input': 'multiple responses per instruction + ranking', 'loss': 'Preference Ranking', 'output': 'reward model'},
            'PPO': {'goal': 'Optimize policy for higher reward', 'input': 'instruction prompts', 'loss': 'PPO loss + KL regularization', 'output': 'aligned final model'}
        }
        for stage, info in objectives.items():
            print(f"
{stage} stage:")
            for key, value in info.items():
                print(f"  {key}: {value}")

RLHFObjectives.explain_objectives()

2. Key Mathematical Foundations of RLHF

Reward Modeling learns a preference function using the Bradley‑Terry model:

class RewardModelMath:
    @staticmethod
    def bradley_terry_model():
        """P(A > B) = sigmoid(r_A - r_B)"""
        pass

    @staticmethod
    def preference_probability(reward_A, reward_B):
        return torch.sigmoid(reward_A - reward_B)

    @staticmethod
    def preference_loss(reward_chosen, reward_rejected):
        return -torch.log(torch.sigmoid(reward_chosen - reward_rejected))

# Example
r_good = torch.tensor(2.0)
r_bad = torch.tensor(-1.0)
prob = RewardModelMath.preference_probability(r_good, r_bad)
loss = RewardModelMath.preference_loss(r_good, r_bad)
print(f"Good answer chosen probability: {prob:.3f}")
print(f"Training loss: {loss:.3f}")

PPO Mathematics focuses on safe policy updates:

class PPOMath:
    @staticmethod
    def compute_ratio(log_prob_new, log_prob_old):
        return torch.exp(log_prob_new - log_prob_old)

    @staticmethod
    def clipped_surrogate_objective(ratio, advantage, clip_ratio=0.2):
        unclipped_obj = ratio * advantage
        clipped_ratio = torch.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio)
        clipped_obj = clipped_ratio * advantage
        return torch.min(unclipped_obj, clipped_obj)

    @staticmethod
    def kl_regularization(log_prob_new, log_prob_old, kl_coeff=0.1):
        kl_div = log_prob_old - log_prob_new
        return kl_coeff * kl_div

    @staticmethod
    def ppo_loss(log_prob_new, log_prob_old, advantage, rewards, kl_coeff=0.1):
        ratio = PPOMath.compute_ratio(log_prob_new, log_prob_old)
        policy_loss = PPOMath.clipped_surrogate_objective(ratio, advantage)
        kl_penalty = PPOMath.kl_regularization(log_prob_new, log_prob_old, kl_coeff)
        return -policy_loss + kl_penalty

# Numerical example
log_prob_new = torch.tensor([-2.0, -1.5, -3.0])
log_prob_old = torch.tensor([-2.2, -1.8, -2.5])
advantage = torch.tensor([1.0, -0.5, 2.0])
ratio = PPOMath.compute_ratio(log_prob_new, log_prob_old)
clipped_obj = PPOMath.clipped_surrogate_objective(ratio, advantage)
print(f"Probability ratio: {ratio}")
print(f"Clipped objective: {clipped_obj}")

3. Why DPO Does Not Need a Reward Model

DPO (Direct Preference Optimization) bypasses the reward‑model stage by directly optimizing the policy using implicit rewards derived from the policy‑reference log‑probability difference.

class DPOTheory:
    """Derivation of DPO"""
    def theoretical_derivation(self):
        """
        1. Standard RLHF objective: π* = argmaxπ E[r] - β·KL(π‖π_ref)
        2. Closed‑form optimal policy: π*(y|x) ∝ π_ref(y|x)·exp(r/β)
        3. Rearranging gives r = β·log(π*/π_ref) + β·log Z(x)
        4. Preference probability becomes a sigmoid of the log‑ratio difference.
        """
        pass

    def dpo_loss(self, policy_model, reference_model, batch):
        prompts = batch['prompts']
        chosen = batch['chosen']
        rejected = batch['rejected']
        log_pi_chosen = policy_model.get_log_prob(prompts, chosen)
        log_pi_rejected = policy_model.get_log_prob(prompts, rejected)
        log_ref_chosen = reference_model.get_log_prob(prompts, chosen)
        log_ref_rejected = reference_model.get_log_prob(prompts, rejected)
        implicit_reward_chosen = self.beta * (log_pi_chosen - log_ref_chosen)
        implicit_reward_rejected = self.beta * (log_pi_rejected - log_ref_rejected)
        loss = -torch.log(torch.sigmoid(implicit_reward_chosen - implicit_reward_rejected))
        return loss.mean()

    def compare_with_ppo(self):
        comparison = {
            'Training stages': {'PPO': 'SFT → RM → PPO (3 stages)', 'DPO': 'SFT → DPO (2 stages)'},
            'Data required': {'PPO': 'instruction + preference + online prompts', 'DPO': 'instruction + preference'},
            'Complexity': {'PPO': 'high (online generation, reward eval)', 'DPO': 'medium (offline)'}
        }
        for aspect, details in comparison.items():
            print(f"
{aspect}:")
            for method, desc in details.items():
                print(f"  {method}: {desc}")

class DPOImplementation(nn.Module):
    """Full DPO implementation"""
    def __init__(self, model, reference_model, beta=0.1):
        super().__init__()
        self.model = model
        self.reference_model = reference_model
        self.beta = beta
        for p in self.reference_model.parameters():
            p.requires_grad = False

    def forward(self, batch):
        prompts = batch['prompt']
        chosen = batch['chosen']
        rejected = batch['rejected']
        chosen_logps = self.get_log_probs(self.model, prompts, chosen)
        rejected_logps = self.get_log_probs(self.model, prompts, rejected)
        with torch.no_grad():
            ref_chosen_logps = self.get_log_probs(self.reference_model, prompts, chosen)
            ref_rejected_logps = self.get_log_probs(self.reference_model, prompts, rejected)
        pi_logratios = chosen_logps - rejected_logps
        ref_logratios = ref_chosen_logps - ref_rejected_logps
        logits = self.beta * (pi_logratios - ref_logratios)
        loss = -F.logsigmoid(logits).mean()
        with torch.no_grad():
            accuracy = (logits > 0).float().mean()
            chosen_rewards = self.beta * (chosen_logps - ref_chosen_logps)
            rejected_rewards = self.beta * (rejected_logps - ref_rejected_logps)
            metrics = {
                'loss': loss.item(),
                'accuracy': accuracy.item(),
                'chosen_rewards': chosen_rewards.mean().item(),
                'rejected_rewards': rejected_rewards.mean().item(),
                'reward_margin': (chosen_rewards - rejected_rewards).mean().item()
            }
        return loss, metrics

    def get_log_probs(self, model, prompts, responses):
        """Compute token‑level log probabilities (simplified)"""
        inputs = self.tokenize_batch(prompts, responses)
        with torch.cuda.amp.autocast():
            outputs = model(**inputs, use_cache=False)
        log_probs = F.log_softmax(outputs.logits, dim=-1)
        response_log_probs = self.extract_response_log_probs(log_probs, inputs['labels'])
        return response_log_probs.sum(dim=-1)

DPO Advantages

Simplified pipeline – no separate reward model.

More stable training; fewer hyper‑parameters.

Strong theoretical guarantees from optimal‑control analysis.

DPO Limitations

Only works with offline preference data; cannot explore online.

Performance may drop if test distribution diverges from training data.

Still relies on high‑quality pairwise annotations.

4. KTO: Kahneman‑Tversky Optimization

KTO applies prospect theory from behavioral economics to handle binary (thumbs‑up / thumbs‑down) feedback.

class KTOTheory:
    def __init__(self, lambda_param=2.25, beta=0.1):
        self.lambda_param = lambda_param  # loss‑aversion coefficient
        self.beta = beta                  # KL regularization coefficient

    @staticmethod
    def prospect_theory_value_function(x, reference_point=0):
        if x >= reference_point:
            return x - reference_point          # gains
        else:
            return -self.lambda_param * (reference_point - x)  # losses

    def kto_objective(self, policy_logp, reference_logp, label):
        """Compute KTO loss for a single example.
        label=1 → good answer, label=0 → bad answer.
        """
        implicit_reward = self.beta * (policy_logp - reference_logp)
        if label == 1:
            utility = self.prospect_theory_value_function(implicit_reward, 0)
            return utility  # maximize
        else:
            utility = self.prospect_theory_value_function(implicit_reward, 0)
            return -utility  # minimize (loss‑aversion)

class KTOImplementation(nn.Module):
    """Full KTO implementation"""
    def __init__(self, model, reference_model, lambda_param=2.25, beta=0.1):
        super().__init__()
        self.model = model
        self.reference_model = reference_model
        self.lambda_param = lambda_param
        self.beta = beta
        for p in self.reference_model.parameters():
            p.requires_grad = False

    def forward(self, batch):
        prompts = batch['prompt']
        responses = batch['response']
        labels = batch['label']  # 1 = good, 0 = bad
        policy_logps = self.get_log_probs(self.model, prompts, responses)
        with torch.no_grad():
            reference_logps = self.get_log_probs(self.reference_model, prompts, responses)
        losses = []
        rewards = []
        for i in range(len(labels)):
            implicit_reward = self.beta * (policy_logps[i] - reference_logps[i])
            rewards.append(implicit_reward.item())
            if labels[i] == 1:
                loss = -torch.log(torch.sigmoid(implicit_reward))
            else:
                loss = -torch.log(torch.sigmoid(-self.lambda_param * implicit_reward))
            losses.append(loss)
        total_loss = torch.stack(losses).mean()
        with torch.no_grad():
            good_rewards = [r for r, l in zip(rewards, labels) if l == 1]
            bad_rewards = [r for r, l in zip(rewards, labels) if l == 0]
            metrics = {
                'loss': total_loss.item(),
                'good_reward_mean': np.mean(good_rewards) if good_rewards else 0,
                'bad_reward_mean': np.mean(bad_rewards) if bad_rewards else 0,
                'reward_separation': (np.mean(good_rewards) - np.mean(bad_rewards)) if good_rewards and bad_rewards else 0
            }
        return total_loss, metrics

KTO vs DPO vs PPO – a concise comparison table:

comparison_table = {
    'PPO': {'Data requirement': 'preference + reward model', 'Stages': 'SFT → RM → PPO', 'Theory': 'Reinforcement learning', 'Pros': 'Mature, reliable', 'Cons': 'Complex, sensitive to hyper‑params'},
    'DPO': {'Data requirement': 'pairwise preference', 'Stages': 'SFT → DPO', 'Theory': 'Optimal‑control', 'Pros': 'Simpler, stable', 'Cons': 'Offline only, distribution‑sensitive'},
    'KTO': {'Data requirement': 'binary feedback', 'Stages': 'SFT → KTO', 'Theory': 'Prospect theory', 'Pros': 'Low data cost, aligns with human intuition', 'Cons': 'Newer, limited empirical evidence'}
}
for aspect in ['Data requirement', 'Stages', 'Theory', 'Pros', 'Cons']:
    print(f"
{aspect}:")
    for alg in ['PPO', 'DPO', 'KTO']:
        print(f"  {alg}: {comparison_table[alg][aspect]}")

5. Practical Algorithm‑Selection Decision Tree

The following selector recommends an alignment algorithm based on data type, resource constraints, and business needs.

class AlignmentAlgorithmSelector:
    """Selects the most suitable alignment algorithm for a project"""
    def __init__(self):
        self.decision_tree = self.build_decision_tree()

    def build_decision_tree(self):
        return {
            'data_type': {
                'pairwise_preferences': {
                    'resource_constraint': {'high': 'DPO', 'medium': 'DPO', 'low': 'PPO'}
                },
                'binary_feedback': {
                    'data_size': {'large': 'KTO', 'small': 'Convert to pairwise then DPO'}
                },
                'scalar_rewards': {
                    'model_type': {'online': 'PPO', 'offline': 'Reward‑based fine‑tuning'}
                }
            }
        }

    def recommend_algorithm(self, specs):
        recommendations = []
        data = self.analyze_data_characteristics(specs)
        resources = self.analyze_resource_constraints(specs)
        business = self.analyze_business_requirements(specs)
        if data['has_pairwise_data'] and resources['gpu_hours'] < 100:
            recommendations.append(('DPO', 0.9, 'Pairwise data with limited resources'))
        if data['has_binary_feedback'] and data['data_size'] > 50000:
            recommendations.append(('KTO', 0.85, 'Large binary‑feedback dataset'))
        if business['need_online_learning']:
            recommendations.append(('PPO', 0.8, 'Online learning required'))
        return sorted(recommendations, key=lambda x: x[1], reverse=True)

    def analyze_data_characteristics(self, specs):
        return {
            'has_pairwise_data': specs.get('pairwise_data', False),
            'has_binary_feedback': specs.get('binary_feedback', False),
            'has_scalar_rewards': specs.get('scalar_rewards', False),
            'data_size': specs.get('data_size', 0)
        }

    def analyze_resource_constraints(self, specs):
        return {
            'gpu_hours': specs.get('gpu_budget', 0),
            'memory_gb': specs.get('memory_limit', 0),
            'training_time_days': specs.get('time_limit', 0)
        }

    def analyze_business_requirements(self, specs):
        return {
            'need_online_learning': specs.get('online_learning', False),
            'safety_critical': specs.get('safety_critical', False),
            'interpretability_needed': specs.get('interpretability', False),
            'latency_sensitive': specs.get('latency_sensitive', False)
        }

6. Tuning Guides for Alignment Algorithms

Each algorithm has a set of recommended hyper‑parameters and practical tips.

class AlignmentTuningStrategies:
    @staticmethod
    def ppo_tuning_guide():
        return {
            'learning_rate': {'range': [1e-6, 5e-6], 'tip': '10× smaller than SFT', 'adaptive': 'adjust based on KL'},
            'kl_coeff': {'range': [0.01, 0.2], 'tip': 'balance over‑fitting vs under‑fitting', 'adaptive': 'KL target'},
            'batch_size': {'range': [16, 128], 'tip': 'smaller batches are more stable', 'constraint': 'GPU memory'},
            'ppo_epochs': {'range': [2, 8], 'tip': 'too many epochs cause over‑fitting', 'early_stop': 'monitor KL'}
        }

    @staticmethod
    def dpo_tuning_guide():
        return {
            'beta': {'range': [0.1, 0.5], 'tip': 'controls deviation from reference', 'effect': 'higher → conservative'},
            'learning_rate': {'range': [5e-7, 2e-6], 'tip': 'DPO is very sensitive to LR', 'schedule': 'cosine decay'},
            'data_ratio': {'chosen_rejected': '1:1', 'tip': 'keep preference data balanced', 'augmentation': 'can augment rejected samples'}
        }

    @staticmethod
    def kto_tuning_guide():
        return {
            'lambda_param': {'range': [2.0, 3.0], 'tip': 'loss‑aversion coefficient, typical 2.25'},
            'beta': {'range': [0.05, 0.2], 'tip': 'smaller than DPO because binary signal is weaker'},
            'data_balance': {'good_bad_ratio': 'prefer balanced', 'minimum': '≥1000 samples per class', 'quality': 'quality > quantity'}
        }

7. Automated Hyper‑Parameter Search

A lightweight Bayesian‑style tuner demonstrates how to explore the hyper‑parameter space for DPO.

class HyperparameterTuner:
    def __init__(self, algorithm='DPO'):
        self.algorithm = algorithm
        self.best_params = {}
        self.search_history = []

    def bayesian_search(self, param_space, eval_function, n_trials=50):
        for trial in range(n_trials):
            if trial == 0:
                params = self.get_default_params()
            else:
                params = self.suggest_params(param_space, trial)
            score = eval_function(params)
            self.search_history.append((params, score))
            if not self.best_params or score > self.best_params['score']:
                self.best_params = {'params': params, 'score': score}
            print(f"Trial {trial}: Score={score:.4f}, Params={params}")
        return self.best_params

    def suggest_params(self, param_space, trial):
        best_so_far = max(self.search_history, key=lambda x: x[1])[0]
        new_params = {}
        for key, (low, high) in param_space.items():
            noise = np.random.normal(0, (high - low) * 0.1)
            new_params[key] = float(np.clip(best_so_far[key] + noise, low, high))
        return new_params

    def get_default_params(self):
        defaults = {
            'DPO': {'beta': 0.1, 'lr': 1e-6},
            'PPO': {'kl_coeff': 0.1, 'lr': 2e-6, 'ppo_epochs': 4},
            'KTO': {'lambda_param': 2.25, 'beta': 0.1, 'lr': 1e-6}
        }
        return defaults.get(self.algorithm, {})

# Example evaluation function (simulated)
def eval_dpo_params(params):
    beta, lr = params['beta'], params['lr']
    base_score = 0.75
    beta_bonus = max(0, 0.1 - abs(beta - 0.1)) * 0.1
    lr_bonus = max(0, 0.01 - abs(lr - 1e-6)) * 1000 * 0.05
    score = base_score + beta_bonus + lr_bonus + np.random.normal(0, 0.02)
    return max(0, min(1, score))

param_space = {'beta': (0.05, 0.5), 'lr': (1e-7, 5e-6)}

tuner = HyperparameterTuner('DPO')
best_result = tuner.bayesian_search(param_space, eval_dpo_params, n_trials=20)
print("
Best hyper‑parameter combination:")
print(f"  Params: {best_result['params']}")
print(f"  Score: {best_result['score']:.4f}")

8. Summary of RLHF Stack Practical Takeaways

Algorithm‑Selection Principles

Choose the algorithm that matches the available data (pairwise → DPO/PPO, binary → KTO).

Prioritize resource‑friendly methods when compute budget is limited.

Align the algorithm with specific business requirements (online learning, safety‑critical, latency).

Experience Highlights

PPO : Most stable but hardest to tune; best for resource‑rich scenarios.

DPO : Engineering‑friendly, high cost‑effectiveness; currently the mainstream choice.

KTO : Lowest data barrier, ideal for quick validation; emerging with strong potential.

Engineering Recommendations

Start with a solid SFT model to ensure basic instruction following.

Focus on data quality; high‑quality annotations outweigh sheer quantity.

Iterate gradually—avoid attempting to solve everything in a single training run.

Build a comprehensive evaluation suite covering offline metrics and online user experience.

Common Pitfalls to Avoid

Ignoring distribution shift between training and production data.

Rushing hyper‑parameter tuning without systematic experimentation.

Assuming offline metrics directly translate to online performance.

Confusing alignment with usefulness; always validate real user impact.

In short, RLHF is not a silver bullet; it requires careful system design, data engineering, and iterative tuning to achieve reliable, user‑aligned language models.

machine learning reinforcement learning RLHF DPO Human Feedback alignment algorithms KTO

Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.