Why Post‑Training Is Redefining LLMs: DPO vs PPO, Synthetic Data, and Scaling Strategies
This article analyzes recent post‑training trends in large language models, comparing DPO and PPO, examining the scarcity of open‑source preference data, the iterative training process, the rise of synthetic data pipelines, and emerging methods for improving math and reasoning capabilities.
DPO vs PPO
The 93‑page technical report for Llama 3.1 expands post‑training but lacks a thorough analysis of the SFT‑to‑DPO transition that was present in Llama 2 with PPO. The authors note that PPO requires more computation for large models, while DPO is simpler to scale, especially for instruction‑following benchmarks like IFEval. Although academic papers show PPO can achieve higher ceilings, open‑source implementations are rare; most models (e.g., Qwen, Llama) use online DPO, and closed‑source models likely retain legacy PPO infrastructure.
Practitioners favor DPO for its practicality and lower resource consumption, such as caching a reference model when training a 405B model, which can save significant compute compared to PPO’s four‑model setup.
Preference Data
Open‑source preference datasets are scarce, often treated more like free‑use software than truly open data, because collecting high‑quality human preferences is costly. Llama 3 largely relies on synthetic data, raising the question of whether human‑annotated preferences still add value, or if LLM‑as‑a‑judge and reward‑model outputs could replace them.
Iteration
Since Llama 2, RLHF has become an iterative process: Llama 2 and Nemotron underwent five training rounds, while Llama 3 used six. Two main reasons drive this iteration: (1) Preference data arrives in batches from external providers, requiring incremental training to incorporate new data and allow engineering adjustments; (2) Preventing reward hacking by updating the reward model frequently, especially when a single‑round training would cause over‑optimization.
Open questions remain about the upper bound of iteration cycles and whether scaling can be achieved by injecting preference data during pre‑training, a concept similar to “mid‑training” proposed by OpenAI.
Synthetic Data
The most striking aspect of the Llama 3 report is its extensive synthetic‑data pipeline, which manages data collection, generation, and filtering across multiple domains. This pipeline is expected to become a competitive moat for companies. The article also critiques a recent Nature‑cover paper for unrealistic assumptions such as discarding old data, keeping dataset size constant, and lacking external feedback, which can lead to model collapse—a problem that can be mitigated according to recent research.
Math and Reasoning
Recent work focuses on enhancing mathematical and reasoning abilities during post‑training by propagating reward signals throughout the reasoning process or decomposing sparse rewards into dense ones. Three main approaches are identified:
Generate data with Monte‑Carlo Tree Search (MCTS) and a value/reward model, then fine‑tune the LLM with step‑wise DPO.
Train a high‑quality Process Reward Model (PRM) and optimize it with PPO‑style alternatives such as GRPO or MDPO, avoiding the critic network.
Leverage formalized problems (e.g., Lean‑STaR) and learn from the feedback they provide.
These methods aim to move beyond the traditional KL‑constrained bandit learning of RLHF, introducing planning‑like components that improve performance on benchmarks. Repeated sampling and LLM‑plus‑search techniques have shown substantial gains in correctness for reasoning tasks.
Post‑training Role and Outlook
Initially, many companies questioned the value of RLHF, finding that supervised fine‑tuning (SFT) often sufficed. Over time, with the emergence of DPO and community‑driven insights, RLHF has gained acceptance as a crucial step for boosting leaderboard scores, customizing chat styles, and enabling smaller models to compete with larger ones.
The future focus will be on building robust pipelines for generating preference and synthetic data, as illustrated by recent architectures. Successful pipelines will integrate reward models throughout data creation, making them essential components of LLM development.
Additionally, as edge‑side small models become more important, distillation from large‑model synthetic data will be a key research direction, potentially spurring new studies on fine‑tuning small models.
Finally, the article suggests that true reinforcement learning may find its niche in LLM agents operating in well‑defined environments (e.g., WebArena, WebShop), where explicit reward functions can be specified.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
