Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

The article examines the fundamental similarity between SFT and RL loss functions for large language models, explains why RL training is prone to instability, discusses infrastructure and data quality challenges, and reviews practical tricks and reward‑model considerations for more reliable RL fine‑tuning.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

SFT vs. RL Loss Functions

Both Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL) for large language models ultimately optimize a cross‑entropy‑like loss. In SFT the target distribution is a one‑hot vector derived from a reference answer, while RL replaces the constant advantage of 1 with a learned advantage A_t(\pi). When the advantage is positive, the gradient pushes the selected token's logit upward and all other logits downward; a negative advantage reverses this direction.

Why RL Is Less Stable Than SFT

RL demands more sophisticated infrastructure: bugs in training frameworks (e.g., Megatron, vLLM, sglang) related to temperature, top‑p/k handling, reward shaping, or rollout token processing can cause crashes. Moreover, RL data pipelines lack the rigorous cleaning and human review that SFT pipelines enjoy, leading to noisy reward signals and unstable optimization.

Common RL Tricks for Stabilization

Entropy collapse: debate whether to add an entropy loss term.

CLIP‑based regularization: many works explore this but practical gains are limited.

Token masking: treat high‑entropy or low‑entropy tokens differently.

Reward shaping: keep the proportion of 0/1 rewards within a range.

Use pass@K instead of pass@1 as the optimization target.

Reward based on test‑case pass rate.

Length penalty.

Training‑inference consistency tricks (e.g., TIS, ICEPPop) that are more about infra than algorithmic novelty.

The author cautions against over‑using tricks whose effects are not well understood, such as entropy or KL losses, because they may mask the underlying data‑distribution problems.

Data Quality Challenges in RL

High‑quality RL data is scarce. Without thorough cleaning, it is hard for practitioners to distinguish hard examples from mislabeled ones. Examples include ambiguous numeric answers (e.g., "9 ¥ can buy 4.5 tickets"), multiple valid solutions for equations, and subtle formatting mismatches that cause reward models to assign zero scores.

Reward models must be strong enough to understand problem statements, recognize equivalent answer forms, and follow instructions; otherwise they penalize correct but differently formatted outputs.

In RL the advantage A_t is a function of the policy \pi , not a fixed constant; it cannot be differentiated analytically and must be estimated by sampling, which introduces variance and instability. By contrast, SFT treats the advantage as a constant 1.

Overall, the piece argues that any training trick ultimately changes the data distribution seen by the model. Analyzing rollout distribution shifts should precede adding stabilizing tricks, and when a trick is useful it can reveal which distributions help or hurt learning.

AILLMreinforcement learningSFTreward modelingtraining stability
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.