A Comprehensive Guide to LLM Post‑Training: From RLHF and GRPO to Agentic RL
This article systematically explains the post‑training pipeline for large language models, covering supervised fine‑tuning, RLHF, PPO, GRPO, RLVR, DPO and emerging Agentic RL, while illustrating each method with analogies, detailed workflows, tables, and recent research findings.
1. Introduction: What Is Post‑Training?
Large language model (LLM) training consists of two stages: pre‑training on massive unlabeled text to learn basic language patterns and world knowledge, and post‑training, which refines the "raw" model into a useful product by teaching it to follow instructions, align with human preferences, reason, and use tools.
Since the release of ChatGPT in 2022, post‑training techniques have exploded. In short, SFT teaches the model "what to say", preference optimization teaches "what to choose", and reinforcement learning teaches the model "how to think".
2. Intuition: A Restaurant Analogy
Imagine opening a restaurant and hiring a talented chef (the pre‑trained model). The chef has read every cookbook (pre‑training data) but has never cooked for guests.
SFT is like a senior chef showing the new chef how to prepare a few signature dishes – the chef learns the correct format and style.
RLHF lets diners taste several dishes and rank them; the ranking becomes a reward model, and the chef improves via PPO/GRPO based on those scores.
DPO skips the separate reward model and directly learns from pairwise preferences.
RLVR is akin to a cooking competition judged by objective criteria (e.g., cake baked in 30 minutes), removing human bias.
Agentic RL turns the chef into a full‑blown head chef who can read recipes, shop for ingredients, and coordinate the kitchen.
3. Technical Deep Dive
3.1 Supervised Fine‑Tuning (SFT) – The Starting Point
SFT fine‑tunes a pre‑trained model on high‑quality (prompt, response) pairs using the standard cross‑entropy loss. Data sources include instruction‑following datasets (Alpaca, ShareGPT), domain‑specific corpora, and multi‑turn dialogues. Synthetic data generated by stronger models (e.g., GPT‑4) is increasingly used and is referred to as knowledge distillation.
Implementation variants:
Full‑parameter fine‑tuning.
Parameter‑efficient fine‑tuning (PEFT) such as LoRA or QLoRA, which add low‑rank matrices to weight matrices and typically train only 0.1 %–1 % of the original parameters.
Key insight: SFT teaches the model the desired output format and style but cannot make the model surpass the capabilities present in the training data, which is why reinforcement learning is required.
3.2 RLHF – Human‑Feedback Reinforcement Learning
RLHF, the core technique behind InstructGPT and ChatGPT, follows a three‑step pipeline (OpenAI 2022):
Step 1 – SFT: Fine‑tune the base model on high‑quality human‑written answers.
Step 2 – Reward Model Training: For each prompt, generate multiple responses (usually four) and have human annotators rank them. The rankings train a reward model that assigns a scalar score to any response.
Step 3 – PPO Optimization: Use the reward model as the reward signal and apply Proximal Policy Optimization (PPO) to further improve the policy.
A notable variant, RLAIF, replaces human annotators with an AI model to provide preference feedback, dramatically reducing labeling cost. Anthropic’s Constitutional AI exemplifies this approach.
3.3 PPO – The Workhorse RL Algorithm
PPO treats the LLM as a policy that maps a prompt (state) to a token sequence (action). The objective maximizes expected reward while constraining policy updates via a clipping mechanism to ensure stability.
PPO in LLM training maintains four models:
Policy Model: the trainable LLM that generates answers.
Reference Model: a frozen copy of the initial policy, used to compute KL‑divergence.
Reward Model: scores each answer.
Value Model (Critic): estimates the state‑value for advantage calculation.
The clipped surrogate loss is L = min(r(θ)·A, clip(r(θ), 1‑ε, 1+ε)·A), where r(θ) is the probability ratio between new and old policies, A is the advantage, and ε (typically 0.1–0.2) defines the clipping range. This clipping is essential for PPO’s stability.
Main drawbacks: PPO requires loading four models simultaneously, leading to high memory consumption; the training loop alternates between rollout generation and parameter updates, increasing engineering complexity; and hyper‑parameter tuning is sensitive.
3.4 GRPO – Group‑Relative Policy Optimization
GRPO (Group Relative Policy Optimization), introduced by the DeepSeek team in 2024, replaces the value model with a group‑wise ranking mechanism. For each prompt, G responses (typically 8–64) are sampled and their rewards are normalized within the group. Answers above the group mean receive positive advantage, while those below receive negative advantage, eliminating the need for a separate critic.
Compared with PPO, GRPO reduces the required model count from four to two‑three, lowers memory demand, and simplifies sampling (multiple answers per prompt instead of a single rollout).
3.5 RLVR – Reinforcement Learning with Verifiable Rewards
RLVR, highlighted as a major 2025 trend, substitutes learned reward models with deterministic rule‑based verifiers. It is suited for domains where answers can be objectively checked:
Mathematics – string matching or specialized math‑verify tools.
Code – sandbox execution with test‑case validation.
Logical reasoning – rule‑based consistency checks.
Scientific questions – LLM judges evaluate answer equivalence.
Reward design typically combines an accuracy component (correctness) with a format component (e.g., enforcing
<think>…</think><answer>…</answer>tags). DeepSeek‑R1 uses GRPO + RLVR as a representative model.
Key insight: RLVR eliminates two major pain points of RLHF: reward hacking and high annotation cost, because the rule itself serves as the perfect reward function in verifiable settings.
3.6 DPO and Its Variants – Preference Optimization Without RL
DPO (Direct Preference Optimization) emerged in 2023 and reframes the optimal RLHF solution as a simple classification loss. Given a pair (preferred, rejected), DPO maximizes the log‑probability ratio of the preferred response while regularizing against a reference model.
Subsequent variants address DPO’s limitations:
SimPO: removes the reference log‑ratio for more stable gradients; works well with noisy or crowdsourced pairwise data.
ORPO: optimizes in odds‑space to handle class imbalance, especially for multilingual or long‑tail data.
KTO: introduces an asymmetric loss based on prospect theory, targeting high‑risk domains such as law or medicine.
All these methods are offline: they train on a static dataset of preference pairs and do not require the model to generate new answers during training, making them simpler and more stable than online RL methods.
3.7 DeepSeek‑R1 – Pure RL Training of a Reasoning Model
DeepSeek‑R1 (early 2025) demonstrated that pure RL training, without any SFT warm‑start, can yield strong reasoning abilities.
R1‑Zero (pure RL route): Directly applies GRPO + RLVR to the pre‑trained DeepSeek‑V3 base model, skipping SFT.
R1 (full route): Starts from R1‑Zero, adds SFT data for a cold‑start, then continues RL training. This improves format compliance and readability while preserving reasoning power.
During training, the model spontaneously exhibits behaviors such as self‑reflection ("Wait, let me reconsider…"), problem decomposition, and multi‑path exploration—phenomena described as "Aha moments" that emerge without explicit programming.
Another observation is that answer length naturally grows as RL progresses, indicating that the model learns to "think longer" for harder problems, effectively reflecting inference‑time scaling on the training side.
3.8 GRPO Improvements: DAPO, Dr.GRPO, and Engineering Tricks
Original GRPO suffers from entropy collapse: as training proceeds, the policy’s entropy drops sharply, and sampled answers become nearly identical, harming exploration.
DAPO (Decoupled Alignment Policy Optimization) proposes four key fixes:
Clip‑Higher: Raises the upper clipping bound for positive advantage, encouraging bolder exploration while still constraining negative updates.
Dynamic Sampling: Filters out prompts where all G answers are either all correct (too easy) or all wrong (too hard), keeping only prompts with discriminative signals.
Overlong Filtering: Assigns zero reward to answers exceeding the maximum length instead of a negative penalty, preventing the model from learning to truncate responses.
Token‑level Loss: Computes loss per token rather than per sequence, avoiding over‑weighting of long sequences.
Dr.GRPO identifies a length‑normalization bias in GRPO and removes the normalization step, further mitigating the tendency to generate uniformly short answers.
4. Global View: How the Components Work Together
A typical post‑training pipeline consists of four stages:
Stage 1 – SFT Cold‑Start: Fine‑tune the base model on high‑quality instruction‑following and chain‑of‑thought data to learn output format and basic reasoning.
Stage 2 – RL Reasoning Training (RLVR): Apply large‑scale RL (GRPO or its variants) in verifiable domains such as mathematics and code to boost reasoning capabilities.
Stage 3 – Preference Alignment: Use DPO or RLHF to align the model’s style, safety, and usefulness with human preferences.
Stage 4 – Rejection Sampling + Distillation (optional): Generate high‑quality reasoning data with the large model and distill it into smaller models; DeepSeek‑R1 follows this approach to produce 1.5B–70B models.
5. Frontiers (2025‑2026)
5.1 Agentic RL – From Answering Questions to Completing Tasks
Traditional RLHF/RLVR focuses on single‑turn QA. Agentic RL trains models to interleave reasoning with tool use (search engines, calculators, code interpreters) across multi‑step tasks. Challenges include credit assignment across steps, sparse rewards (only at task completion), and resource competition between reasoning and tool invocation.
5.2 Reward Model Evolution
Reward models are moving beyond simple scalar scorers:
Process Reward Model (PRM): Scores each reasoning step.
Generative Reward Model: Uses a separate LLM as a judge.
Multi‑objective Reward: Simultaneously optimizes accuracy, safety, brevity, etc.
5.3 Synthetic Data’s Growing Role
The prevailing practice is a generate‑verify‑train loop: a strong model produces many candidate answers, a verifier filters correct ones, and the curated set is used for SFT or as warm‑up data for RL. This loop is becoming the standard paradigm for scaling post‑training.
6. Key Takeaways
SFT is the foundation but cannot alone endow models with advanced reasoning.
RL (PPO → GRPO → DAPO) is the primary driver for reasoning improvements; removing the critic and addressing entropy collapse dramatically reduces resource demands.
Reward‑signal design is pivotal: the field has progressed from human‑feedback RLHF to AI‑feedback RLAIF and now to rule‑based RLVR.
Online RL excels at learning from exploration, while offline preference methods (DPO, SimPO, ORPO, KTO) offer simplicity and stability for alignment.
Agentic RL, which equips models with tool‑use and multi‑step planning, represents the next frontier of LLM post‑training.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
