Why Reinforcement Learning Preserves LLM Generality Better Than Supervised Fine‑Tuning
The article analyzes why reinforcement learning (RL) fine‑tuning retains a large language model's general abilities better than supervised fine‑tuning (SFT), explaining the off‑policy distribution shift of SFT and the on‑policy data consistency, KL penalty, and trust‑region mechanisms that give RL its anti‑forgetting properties.
Paper: Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
URL: arXiv:2510.18874Problem Definition: SFT vs. RL
SFT (Supervised Fine‑Tuning) : Implements behavior cloning. Given expert trajectories (prompt + gold response) it maximizes the likelihood of reproducing those trajectories. The training data usually contain only the downstream task and omit the original pre‑training distribution.
RL (Reinforcement Learning) : Refers to policy‑optimization methods such as PPO or GRPO. The objective is to maximize expected reward, typically comparing the current policy against a reference (often the initial model).
Why SFT Causes Forgetting
SFT is fundamentally off‑policy . The data distribution is supplied externally by experts or stronger models and is highly specialized, creating a large domain shift from the model’s original, general distribution.
The optimization goal is to force the model to fit the provided data (cross‑entropy loss). This is analogous to requiring a driver not only to learn to drive a car (the downstream task) but also to mimic every minute movement of an instructor. If the optimal actions for the downstream task do not overlap with the model’s pre‑trained abilities, gradient descent pushes parameters far from their original region, erasing previously learned features.
Logical conclusion: SFT induces a distribution shift . Without constraints on the original distribution, the model sacrifices global knowledge to accommodate the local optimum of the SFT data, leading to catastrophic forgetting.
How RL Mitigates Forgetting
RL exhibits anti‑forgetting properties because its training data are generated on‑policy —the model samples its own trajectories, keeping the data within the current parameter distribution.
Self‑Consistency of On‑Policy Data
Since the responses are produced by the model itself, they naturally respect the model’s existing language style and logical flow. RL therefore explores solutions inside the model’s existing ability boundary rather than forcing it to imitate an external expert.
Constraining the Optimization Objective
KL‑divergence penalty : The RL loss includes a regularization term that penalizes deviation from the initial policy. Mathematically this adds a term KL(π_current || π_initial) to the loss, acting like a spring that pulls parameters back when they try to move too far for higher reward.
Trust‑region / clipping (e.g., PPO) : PPO limits the magnitude of each update by clipping the probability‑ratio r(θ) = π_θ(a|s) / π_old(a|s). Updates that would cause large changes in this ratio are discarded, preventing abrupt jumps that could erase pre‑trained knowledge.
Logical conclusion: RL produces a distribution sharpening effect. Instead of shifting the distribution to an unknown region, RL suppresses low‑reward paths and amplifies high‑reward ones, sculpting the existing distribution while preserving the model’s general capabilities.
Summary and Implications
SFT as “rote memorization” : Forces the model to memorize a fixed answer; if that answer conflicts with prior knowledge, the original cognition is overwritten, causing forgetting.
RL as “self‑correction” : The model explores using its own knowledge, receives reward feedback, and, guided by KL constraints and clipping, reinforces the correct parts of its internal representation while retaining base capabilities.
Consequently, in the post‑training phase of large language models, reinforcement learning (e.g., PPO) is more effective at improving downstream performance without sacrificing the model’s generalization ability, compared with supervised fine‑tuning, which tends to cause distribution shift and catastrophic forgetting.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
