Artificial Intelligence 11 min read

Why the Best SFT Checkpoint May Hurt RL Performance: Adaptive Early‑Stop Loss (AESL) for LLM Cold‑Start

The paper reveals that over‑optimizing supervised fine‑tuning (SFT) for large language models can diminish their reinforcement‑learning (RL) potential, proposes an Adaptive Early‑Stop Loss (AESL) that balances accuracy and output diversity during cold‑start, and demonstrates across multiple LLMs that AESL consistently yields superior RL results.

Machine Learning Algorithms & Natural Language Processing

Apr 4, 2026

Why the Best SFT Checkpoint May Hurt RL Performance: Adaptive Early‑Stop Loss (AESL) for LLM Cold‑Start

Background

Since 2025 reinforcement learning (RL) has become the default post‑training paradigm for large language models (LLMs). RL can unlock complex reasoning and long chain‑of‑thought (Long‑CoT) abilities without massive human‑annotated data.

Problem

Directly applying RL to a vanilla base model leads to aimless exploration because the model lacks directional guidance. The standard remedy is a lightweight supervised fine‑tuning (SFT) “cold‑start” with a small high‑quality dataset before RL.

Fatal Trap in SFT Cold‑Start

A study accepted at ICLR 2026 (HKUST, Alibaba, Xiamen University) discovered that the checkpoint with the highest validation accuracy after SFT does **not** correspond to the greatest RL potential. In many cases the best‑performing SFT checkpoint yields poorer RL results, sometimes even regressing.

Root Cause Analysis

Limited data size : Over‑optimizing on a small SFT dataset causes over‑fitting, turning the model into a memorizer rather than a generalizer.

Exploration‑exploitation imbalance : Excessive SFT reduces output diversity, shrinking the exploration space needed for successful RL.

Diversity as Early‑Stop Signal

Tracking entropy and self‑BLEU during SFT reveals a “golden turning point”: early in SFT the model retains high diversity while learning new reasoning patterns; later diversity collapses as over‑fitting intensifies. This diversity peak aligns with the highest RL potential.

Adaptive Early‑Stop Loss (AESL)

To obtain a more nuanced cold‑start, the authors replace the standard cross‑entropy loss with a lightweight Adaptive Early‑Stop Loss (AESL). AESL dynamically adjusts the learning weight at two granular levels:

Token‑level control : When the model’s predicted probability for a token is already high, AESL reduces that token’s loss weight, preventing over‑fitting on easy tokens.

Subsequence‑level control : AESL computes the average confidence of the generated prefix; if the prefix already matches the target distribution, later tokens receive a relaxed loss, encouraging exploration beyond the memorized pattern.

The mathematical formulation of AESL and its adaptive weight are shown in the paper (Figures 2 and 3).

Experimental Evaluation

Experiments were conducted on three base models: Qwen2.5‑7B‑Instruct , Qwen2.5‑Math‑7B , and Llama‑3.1‑8B‑Instruct . Benchmarks included AIME 24/25, AMC 23, and MATH‑500 leaderboards.

Across all models, the AESL‑augmented cold‑start followed by RL consistently outperformed:

Direct RL without SFT

Standard cross‑entropy SFT

Other state‑of‑the‑art cold‑start methods

Ablation studies varying data volume and difficulty splits showed that AESL maintained a stable advantage, delivering higher RL potential than traditional SFT in every setting.

Conclusion

The study demonstrates that loss of output diversity can precede the start of RL and undermine post‑training performance. AESL reshapes the LLM post‑training paradigm by emphasizing diversity over pure accuracy during the cold‑start phase, leading to superior RL outcomes across models, data sizes, and difficulty levels.

Code and paper are publicly available at https://github.com/LXXXXR/AESL and https://openreview.net/pdf?id=yezWGJmODg.

LLM reinforcement learning AI training Model Diversity Supervised Fine‑Tuning Adaptive Early‑Stop Loss