Why Merge SFT and RL? Exploring Unified Fine‑Tuning Strategies for LLMs

This article examines the necessity of integrating Supervised Fine‑Tuning (SFT) with Reinforcement Learning (RL) for large language models, surveys alternating, sample‑reuse, simultaneous, and hint‑guided fusion methods, presents the underlying loss functions, and discusses practical trade‑offs such as entropy collapse and importance‑sampling corrections.

Data Party THU
Data Party THU
Data Party THU
Why Merge SFT and RL? Exploring Unified Fine‑Tuning Strategies for LLMs

Why Fuse SFT and RL?

Reinforcement Learning can improve a model's inference ability, but only if the base model already possesses relevant skills; otherwise the exploration space is severely limited. Consequently, most pipelines first apply Supervised Fine‑Tuning (SFT) to endow the model with basic capabilities and then use RL to refine them. Recent studies, however, argue that the two‑stage approach is sub‑optimal and propose a single‑stage fusion.

Background: Standard LLM Training Pipeline

Typical large‑language‑model training consists of three stages: Pre‑training (self‑supervised on massive data), followed by Post‑training, which is split into SFT and RL. Both post‑training stages rely on a diverse prompt collection.

2.1 Supervised Fine‑Tuning (SFT)

During SFT, high‑quality responses are generated for each prompt using expert‑written data, synthetic augmentation, or strong model distillation. Assuming y denotes a human expert or stronger model output, the SFT loss is:

image
image

The gradient of this loss is shown below:

image
image

2.2 Reinforcement Learning (RL)

RL is applied after SFT. In an on‑policy setting, the current policy π samples responses for each prompt. The RL loss is:

image
image

The corresponding policy gradient is:

image
image

Using a baseline to reduce variance yields the baseline‑adjusted REINFORCE gradient:

image
image

2.2.1 GRPO

GRPO (Generalized PPO) is a critic‑free variant of PPO that samples multiple responses per prompt, each receiving a scalar reward. It approximates advantages via intra‑batch standardisation:

image
image

Here, the term refers to the advantage of the i -th token of the j -th response.

3. Alternating SFT and RL

ReLIFT proposes an alternating scheme: during RL, completely erroneous rollouts are stored in a buffer; once the buffer is full, those samples are used for SFT.

image
image

4. Using SFT as Off‑Policy Samples

LUFFY treats SFT data as off‑policy samples and incorporates them into RL via importance sampling. Notation: Non denotes trajectories obtained by directly rolling out the current policy; SFT denotes SFT data.

image
image

4.1 Mixing On‑Policy and Off‑Policy Samples

The simplest approach mixes off‑policy samples into the on‑policy batch, yielding a combined loss:

image
image

where α is a normalisation factor. The original importance‑sampling term is replaced by a corrected coefficient to account for the true off‑policy distribution:

image
image

4.3 Importance‑Sampling Correction

Applying the corrected coefficient (Equation 9) to Equation 8 produces the final mixed loss:

image
image

Training with this loss resolves gradient bias but can cause rapid entropy collapse, as shown in the left‑hand plot.

5. Simultaneous SFT and RL

SRFT combines SFT, off‑policy RL, and on‑policy RL in a single stage. The weighted SFT loss (to down‑weight high‑entropy samples) is:

image
image

The off‑policy RL loss mirrors LUFFY’s formulation:

image
image

The standard on‑policy RL loss for binary rewards {+1, ‑1} is:

image
image

SRFT adds an entropy‑based weight to the positive‑sample part to mitigate entropy collapse:

image
image

The final loss is the sum of the weighted SFT loss, the off‑policy RL loss, and the on‑policy RL loss:

image
image

6. Using SFT as Hint

A "hint" concatenates a problem with a partially correct answer. Standard RL struggles to obtain positive rollouts for hard problems; SFT provides natural positive samples that can be attached as hints, guiding the policy during rollout.

6.1 Constructing Effective Hints

Dynamic hint‑length adjustment (used in [3] and [5]) gradually reduces hint size as training progresses, either via cosine‑annealed scaling or sampling from a binomial distribution based on success probability.

Rollout‑guided hint tuning (proposed in [4]) employs a binary search: if all rollouts fail, lengthen the hint; if all succeed, shorten it; otherwise, keep the current length.

6.2 Training Strategies with Hints

Standard RL can treat hint‑augmented rollouts as ordinary rollouts. Some works (e.g., [5]) mix hint‑based rollouts with regular rollouts, but to avoid instability they filter SFT‑derived tokens, keeping only the top‑k % with highest entropy for gradient updates.

Other approaches assign the SFT loss to the hint portion and the RL loss to the rollout portion, as illustrated below:

image
image

References

[1] Learning What Reinforcement Learning Can't: Interleaved Online Fine‑Tuning for Hardest Questions – https://arxiv.org/pdf/2506.07527

[2] Learning to Reason under Off‑Policy Guidance – https://arxiv.org/pdf/2504.14945

[3] UFT: Unifying Supervised and Reinforcement Fine‑Tuning – https://arxiv.org/pdf/2505.16984

[4] BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning – https://arxiv.org/pdf/2506.17211

[5] Blending Supervised and Reinforcement Fine‑Tuning with Prefix Sampling – https://arxiv.org/pdf/2507.01679

[6] SRFT: A Single‑Stage Method with Supervised and Reinforcement Fine‑Tuning for Reasoning – https://arxiv.org/pdf/2506.19767

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AILLMSFTsupervised learningRL
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.