SFT Scores Don’t Predict RL Potential: Adaptive Early‑Stop Loss for LLMs
The authors show that high SFT accuracy does not guarantee strong RL performance because over‑fitting reduces output diversity, and they propose Adaptive Early‑Stop Loss (AESL), a diversity‑aware early‑stopping objective that dynamically weights token and subsequence losses, yielding consistently better RL results on multiple LLMs and math benchmarks.
Background
Since 2025, reinforcement learning (RL) has become the default post‑training paradigm for large language models (LLMs). Researchers have demonstrated that RL can unlock complex reasoning and long chain‑of‑thought (Long‑CoT) abilities without massive human‑annotated data, even achieving super‑human performance on certain tasks.
Problem with Standard SFT Cold‑Start
In practice, directly feeding a base model to an RL algorithm often leads to aimless exploration because the model lacks directional guidance. The common remedy is to perform a lightweight supervised fine‑tuning (SFT) on a small high‑quality dataset as a “cold‑start” before RL. However, a critical question arises: how far should SFT be trained? The authors observe that the checkpoint with the highest SFT validation accuracy does not correspond to the greatest RL potential.
They attribute this mismatch to a fundamental divergence between the goals of ordinary SFT (maximizing accuracy) and SFT intended as an RL cold‑start (preserving diversity). Over‑optimizing on a limited dataset causes over‑fitting, turning the model into a “memorizer” that loses the broad knowledge distribution and generative diversity needed for effective exploration during RL.
Diversity‑Based Early‑Stopping Insight
Tracking entropy and self‑BLEU during SFT reveals a “golden turning point”: early in training, the model retains high diversity while learning new reasoning formats; later, diversity collapses as the model overfits. This turning point aligns with the peak RL potential.
Therefore, accuracy alone should not dictate when to stop SFT. Instead, the authors propose monitoring output diversity and stopping when diversity begins to decline.
Adaptive Early‑Stop Loss (AESL)
To implement diversity‑aware early stopping, the authors introduce Adaptive Early‑Stop Loss (AESL), a lightweight training objective that replaces the standard cross‑entropy loss during the cold‑start phase.
AESL dynamically adjusts the loss weight for each token based on its current prediction confidence: if the model is already confident about a token, its loss weight is reduced, preventing over‑fitting on that token. At the subsequence level, AESL computes the average confidence of the generated prefix; when the prefix aligns well with the target distribution, the loss for subsequent tokens is relaxed, encouraging exploration.
The core mathematical form of AESL (shown in the paper) defines an adaptive weight w(t) that scales the cross‑entropy term for token t, and a subsequence weight that scales based on prefix confidence.
By modulating learning intensity at both token and subsequence granularity, AESL embodies a “personalized teaching” philosophy: the model is not forced to perfectly fit every demonstration, but is guided to retain its innate knowledge and exploratory capacity.
Experimental Validation
The team evaluated AESL on challenging mathematical reasoning benchmarks (AIME 24/25, AMC 23, MATH‑500) using three base models: Qwen2.5‑7B‑Instruct, Qwen2.5‑Math‑7B, and Llama‑3.1‑8B‑Instruct. Across all models, the AESL‑initialized cold‑start followed by RL consistently outperformed three baselines: direct RL without SFT, standard CE‑loss SFT, and other state‑of‑the‑art methods.
Results showed that AESL + RL achieved the highest average scores on every leaderboard, demonstrating that preserving diversity during cold‑start yields superior RL potential.
Further ablations examined varying data sizes and difficulty levels. AESL remained robust, delivering better RL potential regardless of the amount or hardness of the cold‑start data.
Conclusion
The study warns that loss of output diversity can occur before RL even begins, making diversity preservation essential throughout post‑training. AESL not only provides a new loss function but also reshapes our understanding of the SFT‑to‑RL pipeline, showing that “maintaining diversity” outweighs “perfect imitation” for long‑term RL success.
Future work is expected to further explore the fundamental differences between SFT and RL paradigms, with AESL offering a strong starting point for such investigations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
