What’s Driving the Next Wave of LLM Post‑Training? A Deep Dive into SFT, RLHF, GRPO and Emerging Trends

This article systematically reviews the core post‑training techniques for large language models—including supervised fine‑tuning, RLHF, PPO, GRPO, DPO, RLVR and Agentic RL—explains their evolution, compares their trade‑offs, and highlights the most promising research directions for 2025‑2026.

Data Party THU
Data Party THU
Data Party THU
What’s Driving the Next Wave of LLM Post‑Training? A Deep Dive into SFT, RLHF, GRPO and Emerging Trends

Introduction

Large language model (LLM) training consists of two stages. Pre‑training learns language patterns from massive unlabeled text, producing a generic base model. Post‑training refines this base model so that it follows instructions, aligns with human preferences, and acquires reasoning and tool‑use capabilities.

Core Post‑training Methods

Supervised Fine‑Tuning (SFT)

SFT fine‑tunes a pre‑trained model on high‑quality (prompt, response) pairs using the cross‑entropy loss. Typical data sources are instruction‑following datasets (e.g., Alpaca, ShareGPT), domain‑specific corpora, and multi‑turn dialogues. Synthetic data generated by stronger models (knowledge distillation) is increasingly used to augment SFT. Implementations include:

Full‑parameter fine‑tuning.

Parameter‑efficient fine‑tuning (PEFT) such as LoRA or QLoRA, which train only 0.1‑1 % of the parameters by adding low‑rank adapters.

Reinforcement Learning from Human Feedback (RLHF)

RLHF follows a three‑step pipeline:

SFT : Train an initial instruction‑following model.

Reward Model (RM) : For each prompt, generate multiple responses, collect human rankings, and train a scalar reward model.

PPO Optimization : Use Proximal Policy Optimization (PPO) to improve the policy model against the RM while keeping a frozen reference model for KL‑penalty.

A variant, RLAIF, replaces human annotators with an AI judge (e.g., Constitutional AI) to reduce labeling cost.

Proximal Policy Optimization (PPO)

PPO treats the LLM as a policy mapping prompts (states) to token sequences (actions). It maximizes expected reward while clipping policy updates to ensure stability. Four models are maintained during training: Policy Model – the trainable LLM. Reference Model – frozen copy for KL regularization. Reward Model – provides scalar reward. Value Model (Critic) – estimates advantage.

Drawbacks: high memory consumption (four models) and sensitivity to hyper‑parameters.

Group Relative Policy Optimization (GRPO)

GRPO removes the critic by normalizing rewards within a sampled group of responses. For each prompt, 8‑64 answers are sampled, their rewards are normalized, and answers above the group mean receive positive advantage while below‑mean answers receive negative advantage. This reduces memory usage to 2‑3 models and simplifies training.

Reinforcement Learning with Verifiable Rewards (RLVR)

RLVR replaces learned reward models with deterministic, rule‑based validators, making it suitable for domains where answers can be objectively checked. Typical validators:

Mathematics – exact string match or Math‑Verify.

Code – sandbox execution with test cases.

Logical reasoning – rule‑based consistency checks.

Scientific questions – LLM‑based judges for answer equivalence.

Rewards combine an accuracy component (correctness) and a format component (e.g.,

<think>…</think><answer>…</answer>

). RLVR mitigates reward hacking and lowers annotation cost.

Direct Preference Optimization (DPO) and Variants

DPO reformulates the RLHF objective as a binary classification loss over preferred vs. rejected responses, eliminating the need for a separate reward model or RL loop. Extensions improve stability and data handling:

SimPO – removes the reference log‑ratio for smoother gradients.

ORPO – optimizes in odds‑space to address class imbalance.

KTO – asymmetric loss based on prospect theory for high‑risk domains.

DPO methods are offline: they train on static pairwise preference data, offering simplicity and stability at the expense of online exploration benefits.

DeepSeek‑R1: Pure RL Training

DeepSeek‑R1 demonstrated that pure RL (GRPO + RLVR) without any SFT can yield strong reasoning abilities. Two training routes are reported:

R1‑Zero : Apply GRPO + RLVR directly on the pre‑trained DeepSeek‑V3 base model.

R1 (full) : Warm‑start with a small amount of SFT data before RL, improving formatting while preserving reasoning power.

During training the model spontaneously generates longer, reflective answers (“Aha moments”) and exhibits inference‑time scaling.

GRPO Improvements (DAPO, Dr‑GRPO)

Original GRPO suffered from entropy collapse (loss of output diversity). DAPO introduces four fixes:

Clip‑Higher – loosens the upper clipping bound for positive advantage.

Dynamic Sampling – discards prompts where all sampled answers are uniformly correct or incorrect.

Overlong Filtering – assigns zero reward to overly long answers instead of penalizing them.

Token‑level Loss – computes loss per token to avoid over‑weighting long sequences.

Dr‑GRPO removes the length‑normalization bias present in the original GRPO.

Typical Post‑training Pipeline

SFT cold‑start with high‑quality instruction and chain‑of‑thought data.

RL reasoning training (RLVR) using GRPO or DAPO on verifiable tasks.

Preference alignment with DPO or RLHF to polish style, safety, and usefulness.

Optional rejection sampling and distillation to compress reasoning ability into smaller models.

Emerging Directions (2025‑2026)

Agentic RL

Beyond single‑turn Q&A, Agentic RL trains models to interleave reasoning with tool use (search engines, calculators, code interpreters). Key challenges include credit assignment across multi‑step episodes, sparse reward signals, and resource competition between reasoning and tool execution.

Reward Model Evolution

Reward models are moving from scalar scorers to richer forms:

Process Reward Models – score each reasoning step.

Generative Reward Models – let an LLM act as a judge.

Multi‑objective Reward Models – balance accuracy, safety, and conciseness.

Synthetic Data Loops

Strong models generate candidate answers, deterministic verifiers filter correct ones, and the curated set fuels further SFT or RL warm‑up. This generate‑verify‑train cycle is becoming the standard practice.

Key Takeaways

SFT establishes format and basic capability but cannot alone produce deep reasoning.

Online RL (PPO → GRPO → DAPO) is the primary driver of reasoning improvements.

Reward design (RLHF → RLAIF → RLVR) critically influences alignment quality and annotation cost.

Offline preference methods (DPO and its variants) offer stability; online RL provides exploration‑driven gains. In practice they are combined.

Agentic RL represents the next frontier, turning LLMs into autonomous agents capable of multi‑step tool‑augmented tasks.

References

https://arxiv.org/abs/2503.06072

https://medium.com/@fahey_james/dpo-isnt-enough-the-modern-post-training-stack-simpo-orpo-kto-and-bey

https://icml.cc/virtual/2025/poster/44492

https://arxiv.org/abs/2203.02155

https://arxiv.org/abs/2309.00267

https://arxiv.org/abs/1707.06347

https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training

https://arxiv.org/abs/2402.03300

https://www.emergentmind.com/topics/reinforcement-learning-with-verified-rewards-rlvr

https://arxiv.org/abs/2501.12948

https://arxiv.org/abs/2305.18290

https://arxiv.org/abs/2503.14476

https://cameronrwolfe.substack.com/p/grpo-tricks

https://arxiv.org/abs/2602.00994

https://huggingface.co/blog/LinkedIn/gpt-oss-agentic-rl

LLMSFTRLHFGRPOAI alignmentpost-training
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.