Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks
The article examines the fundamental similarity between SFT and RL loss functions for large language models, explains why RL training is prone to instability, discusses infrastructure and data quality challenges, and reviews practical tricks and reward‑model considerations for more reliable RL fine‑tuning.
SFT vs. RL Loss Functions
Both Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL) for large language models ultimately optimize a cross‑entropy‑like loss. In SFT the target distribution is a one‑hot vector derived from a reference answer, while RL replaces the constant advantage of 1 with a learned advantage A_t(\pi). When the advantage is positive, the gradient pushes the selected token's logit upward and all other logits downward; a negative advantage reverses this direction.
Why RL Is Less Stable Than SFT
RL demands more sophisticated infrastructure: bugs in training frameworks (e.g., Megatron, vLLM, sglang) related to temperature, top‑p/k handling, reward shaping, or rollout token processing can cause crashes. Moreover, RL data pipelines lack the rigorous cleaning and human review that SFT pipelines enjoy, leading to noisy reward signals and unstable optimization.
Common RL Tricks for Stabilization
Entropy collapse: debate whether to add an entropy loss term.
CLIP‑based regularization: many works explore this but practical gains are limited.
Token masking: treat high‑entropy or low‑entropy tokens differently.
Reward shaping: keep the proportion of 0/1 rewards within a range.
Use pass@K instead of pass@1 as the optimization target.
Reward based on test‑case pass rate.
Length penalty.
Training‑inference consistency tricks (e.g., TIS, ICEPPop) that are more about infra than algorithmic novelty.
The author cautions against over‑using tricks whose effects are not well understood, such as entropy or KL losses, because they may mask the underlying data‑distribution problems.
Data Quality Challenges in RL
High‑quality RL data is scarce. Without thorough cleaning, it is hard for practitioners to distinguish hard examples from mislabeled ones. Examples include ambiguous numeric answers (e.g., "9 ¥ can buy 4.5 tickets"), multiple valid solutions for equations, and subtle formatting mismatches that cause reward models to assign zero scores.
Reward models must be strong enough to understand problem statements, recognize equivalent answer forms, and follow instructions; otherwise they penalize correct but differently formatted outputs.
In RL the advantage A_t is a function of the policy \pi , not a fixed constant; it cannot be differentiated analytically and must be estimated by sampling, which introduces variance and instability. By contrast, SFT treats the advantage as a constant 1.
Overall, the piece argues that any training trick ultimately changes the data distribution seen by the model. Analyzing rollout distribution shifts should precede adding stabilizing tricks, and when a trick is useful it can reveal which distributions help or hurt learning.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
