Can 1K Fine‑Tuning Replace 100K RL Steps? Insights from Re‑distillation Research
An extensive analysis shows that a 1K‑sample fine‑tuning stage can replicate the generalization gains of thousands of reinforcement‑learning steps, explains the compressibility of RL, introduces a sample‑effect theory, and demonstrates that re‑distillation and small‑scale SFT dramatically improve LLM performance.
Paper and Reproduction Code
Paper: Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning
Abs: https://arxiv.org/abs/2505.17988
GitHub: https://github.com/on1262/deep-reasoningMain Conclusions
RL Process Is Compressible
Re‑distillation (re‑distilled SFT) can achieve the same generalization performance as R1‑style reinforcement learning (RL) using only 1 K SFT samples , whereas the original RL requires >100 K sampling steps. This shows that the RL advantage is not inherent; the same performance can be obtained with a much smaller data budget.
Small‑Scale SFT as a Crucial Cold‑Start
Performing SFT with 1 K high‑quality examples before RL dramatically improves the RL convergence curve. Neither the base model nor an instruct‑tuned model provides an optimal starting point. Using an appropriate cold‑start dataset, a 1.5 B Qwen2.5 model reaches test accuracy >0.8 on the K&K benchmark, surpassing DeepSeek‑V3‑0324 without any curriculum‑learning tricks.
Sample‑Effect Theory
Under a linearized kernel‑method assumption, each training example contributes a computable sample effect , defined as the inner product between the expected gradient of that example (under cross‑entropy loss) and the overall gradient expectation. Experiments reveal that the most effective SFT samples are those with the highest sample effect, not necessarily those exhibiting the deepest reasoning patterns.
Theoretical Rationale for Re‑distillation Efficiency
Re‑distillation can be viewed as SFT on data generated by an RL‑trained policy. Since RL increases the sample effect of its output distribution, the distilled data are intrinsically more efficient for SFT, explaining why fewer than 1 K examples suffice to match RL‑level performance.
RL Exploration Relies on Early‑Token Shaping
During RL the output pattern shifts from the tail forward: later tokens change first, while early tokens—having a large impact on the final answer—remain hard to modify. Small‑scale SFT reshapes the early‑token distribution, providing a far more effective exploration signal than random RL exploration.
Preliminary Exploration: Impact of Small‑Scale SFT on RL
We define small‑scale SFT as fine‑tuning with ≤2 K examples to isolate the effect of the model’s intrinsic ability. Experiments on the K&K dataset using Qwen2.5‑1.5B and the GRPO RL algorithm show:
Only the policy distilled from DeepSeek‑R1 with 1 K correct samples (long‑CoT) reaches 0.8 test accuracy after SFT.
Short‑CoT (program‑generated) performs worse, and the advantage of long‑CoT does not transfer to the MATH dataset, indicating dataset‑specific behavior.
SFT performance does not predict RL performance: all initializations achieve <10% accuracy after SFT on K&K, yet their RL convergence speeds differ markedly.
Theoretical Analysis
Stochastic Differential Equation (SDE) Model of RL
We model the simplest REINFORCE update with a binary 0‑1 reward as
∇θ L = (1/N) Σ_i ∇θ log πθ(a_i|s_i)·r_iwhere N is the batch size, a_i the generated response, s_i the prompt, and r_i the reward. For large N, the central limit theorem yields a Gaussian gradient noise, allowing us to write the parameter dynamics as an SDE: dθ_t = A(θ_t) dt + B(θ_t) dW_t The drift term A(θ_t) captures the expected test‑accuracy growth rate. Decomposing the drift gives a non‑negative positive effect (growth as N→∞) and a typically negative noise effect (variance‑induced slowdown). Reducing learning rate or increasing buffer size weakens the noise effect, explaining why large learning rates and small batch sizes often cause RL failure.
Sample Effect and Test‑Accuracy Growth
Each training sample s contributes to the drift via an expectation‑weighted inner product, which we define as the sample effect :
e(s) = ⟨E[∇θ ℓ_ce(s)], E[∇θ ℓ_ce]⟩where ℓ_ce is the cross‑entropy loss. The overall test‑accuracy growth rate is approximately the sum of sample effects over the dataset.
Optimal Target Distribution for SFT Distillation
We consider correctness‑filtered distillation: the target distribution p(a|s) samples from the model’s output and discards incorrect answers. Adding a KL‑regularization term, the optimal distribution (derived via DPO‑style reasoning) is: p*(a|s) ∝ exp(β·Z(a,s)) where Z(a,s) depends on sample correctness and β controls the trade‑off. If sample a₁ has a larger sample effect than a₂, a sufficiently small β makes p*(a₁|s) > p*(a₂|s). Thus, training on samples with higher sample effect maximizes SFT efficiency. The KL constraint also implies that after correctness filtering, the overall accuracy of the target policy does not affect the optimal distribution.
RL Increases Sample Effect
We define the dataset effect as the expected sample effect of a policy over a filtered dataset. Using the same SDE framework, we prove that after the first RL iteration the dataset effect grows at least quadratically with the increase in reward, i.e., Δ(dataset effect) ≥ (Δ reward)^2 This growth stems from correctness filtering and holds regardless of whether the reward decreases (which cannot happen under the drift‑only approximation).
Re‑distillation Procedure
The method consists of three steps:
Perform small‑scale SFT (< 2 K examples) on the base model to obtain an SFT‑ed model.
Run RL (GRPO) on the SFT‑ed model to produce a target policy.
Sample a small set of ≤1 K outputs from the RL‑trained policy, then fine‑tune the original base model on this distilled data (identical to ordinary SFT). The resulting model is the re‑distilled model.
Empirically, re‑distillation reaches RL‑level test accuracy on K&K with only 1 K examples, achieving a 5× efficiency boost (80% accuracy within 25 RL steps). On MATH, using 496 distilled examples yields accuracy comparable to instruct‑tuned models and identical reward curves thereafter.
Validation of the Linearization Assumption
To test whether the single‑step linear approximation predicts multi‑step behavior, we compute sample‑effect‑based growth rates for both SFT and RL and compare them with actual training curves. Adjustments include:
Computing sample effect under both SGD and Adam (Adam simulated with batch size 1 for SFT, batch size 20 for RL to match the real 64:1024 ratio).
Incorporating gradient clipping and the GRPO loss.
Using a fixed test‑set gradient direction: the parameter difference between the base model and its checkpoint after 25 RL steps on Qwen2.5‑1.5B.
Evaluating instruct, long‑CoT, and short‑CoT under this fixed direction to avoid self‑bias.
The left panel (SFT) shows that re‑distillation achieves the highest estimated growth, matching empirical SFT results. The right panel (RL) shows the ordering long‑CoT > short‑CoT > re‑distill, consistent with observed reward curves. Overall, the linearized sample‑effect estimates align with real training trends, supporting the theoretical framework.
Exploration Dilemma: Why SFT Improves Long‑Term Exploration
We analyze token‑level log‑probability quantiles in the RL replay buffer of long‑CoT‑math. For each token position we compute the 1 % quantile of log‑probabilities under the initial policy. Over RL steps the quantile curve shifts forward (earlier positions improve), indicating that RL first modifies later tokens and only later affects early tokens.
In contrast, SFT directly boosts early‑token probabilities. This complementary behavior demonstrates that, when prior knowledge about desirable exploration patterns exists, modifying early‑token distributions via small‑scale SFT is far more effective than relying on RL’s stochastic exploration.
Discussion
The study confirms three key points: (1) RL is highly compressible, allowing small‑scale SFT to replace extensive RL sampling; (2) a sample‑effect metric predicts which SFT examples are most beneficial; (3) re‑distillation can reproduce RL‑level performance with <1 K examples, yielding a 5× efficiency gain on the K&K benchmark. Remaining open questions include refining the linearization hypothesis under distribution shift, improving early‑checkpoint filtering to reduce the initial RL cost, and exploring the impact of larger‑scale SFT.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
