Why Merge SFT and RL? Exploring Unified Fine‑Tuning Strategies for LLMs
This article examines the necessity of integrating Supervised Fine‑Tuning (SFT) with Reinforcement Learning (RL) for large language models, surveys alternating, sample‑reuse, simultaneous, and hint‑guided fusion methods, presents the underlying loss functions, and discusses practical trade‑offs such as entropy collapse and importance‑sampling corrections.
Why Fuse SFT and RL?
Reinforcement Learning can improve a model's inference ability, but only if the base model already possesses relevant skills; otherwise the exploration space is severely limited. Consequently, most pipelines first apply Supervised Fine‑Tuning (SFT) to endow the model with basic capabilities and then use RL to refine them. Recent studies, however, argue that the two‑stage approach is sub‑optimal and propose a single‑stage fusion.
Background: Standard LLM Training Pipeline
Typical large‑language‑model training consists of three stages: Pre‑training (self‑supervised on massive data), followed by Post‑training, which is split into SFT and RL. Both post‑training stages rely on a diverse prompt collection.
2.1 Supervised Fine‑Tuning (SFT)
During SFT, high‑quality responses are generated for each prompt using expert‑written data, synthetic augmentation, or strong model distillation. Assuming y denotes a human expert or stronger model output, the SFT loss is:
The gradient of this loss is shown below:
2.2 Reinforcement Learning (RL)
RL is applied after SFT. In an on‑policy setting, the current policy π samples responses for each prompt. The RL loss is:
The corresponding policy gradient is:
Using a baseline to reduce variance yields the baseline‑adjusted REINFORCE gradient:
2.2.1 GRPO
GRPO (Generalized PPO) is a critic‑free variant of PPO that samples multiple responses per prompt, each receiving a scalar reward. It approximates advantages via intra‑batch standardisation:
Here, the term refers to the advantage of the i -th token of the j -th response.
3. Alternating SFT and RL
ReLIFT proposes an alternating scheme: during RL, completely erroneous rollouts are stored in a buffer; once the buffer is full, those samples are used for SFT.
4. Using SFT as Off‑Policy Samples
LUFFY treats SFT data as off‑policy samples and incorporates them into RL via importance sampling. Notation: Non denotes trajectories obtained by directly rolling out the current policy; SFT denotes SFT data.
4.1 Mixing On‑Policy and Off‑Policy Samples
The simplest approach mixes off‑policy samples into the on‑policy batch, yielding a combined loss:
where α is a normalisation factor. The original importance‑sampling term is replaced by a corrected coefficient to account for the true off‑policy distribution:
4.3 Importance‑Sampling Correction
Applying the corrected coefficient (Equation 9) to Equation 8 produces the final mixed loss:
Training with this loss resolves gradient bias but can cause rapid entropy collapse, as shown in the left‑hand plot.
5. Simultaneous SFT and RL
SRFT combines SFT, off‑policy RL, and on‑policy RL in a single stage. The weighted SFT loss (to down‑weight high‑entropy samples) is:
The off‑policy RL loss mirrors LUFFY’s formulation:
The standard on‑policy RL loss for binary rewards {+1, ‑1} is:
SRFT adds an entropy‑based weight to the positive‑sample part to mitigate entropy collapse:
The final loss is the sum of the weighted SFT loss, the off‑policy RL loss, and the on‑policy RL loss:
6. Using SFT as Hint
A "hint" concatenates a problem with a partially correct answer. Standard RL struggles to obtain positive rollouts for hard problems; SFT provides natural positive samples that can be attached as hints, guiding the policy during rollout.
6.1 Constructing Effective Hints
Dynamic hint‑length adjustment (used in [3] and [5]) gradually reduces hint size as training progresses, either via cosine‑annealed scaling or sampling from a binomial distribution based on success probability.
Rollout‑guided hint tuning (proposed in [4]) employs a binary search: if all rollouts fail, lengthen the hint; if all succeed, shorten it; otherwise, keep the current length.
6.2 Training Strategies with Hints
Standard RL can treat hint‑augmented rollouts as ordinary rollouts. Some works (e.g., [5]) mix hint‑based rollouts with regular rollouts, but to avoid instability they filter SFT‑derived tokens, keeping only the top‑k % with highest entropy for gradient updates.
Other approaches assign the SFT loss to the hint portion and the RL loss to the rollout portion, as illustrated below:
References
[1] Learning What Reinforcement Learning Can't: Interleaved Online Fine‑Tuning for Hardest Questions – https://arxiv.org/pdf/2506.07527
[2] Learning to Reason under Off‑Policy Guidance – https://arxiv.org/pdf/2504.14945
[3] UFT: Unifying Supervised and Reinforcement Fine‑Tuning – https://arxiv.org/pdf/2505.16984
[4] BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning – https://arxiv.org/pdf/2506.17211
[5] Blending Supervised and Reinforcement Fine‑Tuning with Prefix Sampling – https://arxiv.org/pdf/2507.01679
[6] SRFT: A Single‑Stage Method with Supervised and Reinforcement Fine‑Tuning for Reasoning – https://arxiv.org/pdf/2506.19767
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
