Why the Best SFT Checkpoint May Hurt RL Performance: Adaptive Early‑Stop Loss (AESL) for LLM Cold‑Start
The paper reveals that over‑optimizing supervised fine‑tuning (SFT) for large language models can diminish their reinforcement‑learning (RL) potential, proposes an Adaptive Early‑Stop Loss (AESL) that balances accuracy and output diversity during cold‑start, and demonstrates across multiple LLMs that AESL consistently yields superior RL results.
Background
Since 2025 reinforcement learning (RL) has become the default post‑training paradigm for large language models (LLMs). RL can unlock complex reasoning and long chain‑of‑thought (Long‑CoT) abilities without massive human‑annotated data.
Problem
Directly applying RL to a vanilla base model leads to aimless exploration because the model lacks directional guidance. The standard remedy is a lightweight supervised fine‑tuning (SFT) “cold‑start” with a small high‑quality dataset before RL.
Fatal Trap in SFT Cold‑Start
A study accepted at ICLR 2026 (HKUST, Alibaba, Xiamen University) discovered that the checkpoint with the highest validation accuracy after SFT does **not** correspond to the greatest RL potential. In many cases the best‑performing SFT checkpoint yields poorer RL results, sometimes even regressing.
Root Cause Analysis
Limited data size : Over‑optimizing on a small SFT dataset causes over‑fitting, turning the model into a memorizer rather than a generalizer.
Exploration‑exploitation imbalance : Excessive SFT reduces output diversity, shrinking the exploration space needed for successful RL.
Diversity as Early‑Stop Signal
Tracking entropy and self‑BLEU during SFT reveals a “golden turning point”: early in SFT the model retains high diversity while learning new reasoning patterns; later diversity collapses as over‑fitting intensifies. This diversity peak aligns with the highest RL potential.
Adaptive Early‑Stop Loss (AESL)
To obtain a more nuanced cold‑start, the authors replace the standard cross‑entropy loss with a lightweight Adaptive Early‑Stop Loss (AESL). AESL dynamically adjusts the learning weight at two granular levels:
Token‑level control : When the model’s predicted probability for a token is already high, AESL reduces that token’s loss weight, preventing over‑fitting on easy tokens.
Subsequence‑level control : AESL computes the average confidence of the generated prefix; if the prefix already matches the target distribution, later tokens receive a relaxed loss, encouraging exploration beyond the memorized pattern.
The mathematical formulation of AESL and its adaptive weight are shown in the paper (Figures 2 and 3).
Experimental Evaluation
Experiments were conducted on three base models: Qwen2.5‑7B‑Instruct , Qwen2.5‑Math‑7B , and Llama‑3.1‑8B‑Instruct . Benchmarks included AIME 24/25, AMC 23, and MATH‑500 leaderboards.
Across all models, the AESL‑augmented cold‑start followed by RL consistently outperformed:
Direct RL without SFT
Standard cross‑entropy SFT
Other state‑of‑the‑art cold‑start methods
Ablation studies varying data volume and difficulty splits showed that AESL maintained a stable advantage, delivering higher RL potential than traditional SFT in every setting.
Conclusion
The study demonstrates that loss of output diversity can precede the start of RL and undermine post‑training performance. AESL reshapes the LLM post‑training paradigm by emphasizing diversity over pure accuracy during the cold‑start phase, leading to superior RL outcomes across models, data sizes, and difficulty levels.
Code and paper are publicly available at https://github.com/LXXXXR/AESL and https://openreview.net/pdf?id=yezWGJmODg.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
