Can 1K Fine‑Tuning Replace 100K RL Steps? Insights from Re‑distillation Research

An extensive analysis shows that a 1K‑sample fine‑tuning stage can replicate the generalization gains of thousands of reinforcement‑learning steps, explains the compressibility of RL, introduces a sample‑effect theory, and demonstrates that re‑distillation and small‑scale SFT dramatically improve LLM performance.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Can 1K Fine‑Tuning Replace 100K RL Steps? Insights from Re‑distillation Research

Paper and Reproduction Code

Paper: Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning
Abs: https://arxiv.org/abs/2505.17988
GitHub: https://github.com/on1262/deep-reasoning

Main Conclusions

RL Process Is Compressible

Re‑distillation (re‑distilled SFT) can achieve the same generalization performance as R1‑style reinforcement learning (RL) using only 1 K SFT samples , whereas the original RL requires >100 K sampling steps. This shows that the RL advantage is not inherent; the same performance can be obtained with a much smaller data budget.

Small‑Scale SFT as a Crucial Cold‑Start

Performing SFT with 1 K high‑quality examples before RL dramatically improves the RL convergence curve. Neither the base model nor an instruct‑tuned model provides an optimal starting point. Using an appropriate cold‑start dataset, a 1.5 B Qwen2.5 model reaches test accuracy >0.8 on the K&K benchmark, surpassing DeepSeek‑V3‑0324 without any curriculum‑learning tricks.

Sample‑Effect Theory

Under a linearized kernel‑method assumption, each training example contributes a computable sample effect , defined as the inner product between the expected gradient of that example (under cross‑entropy loss) and the overall gradient expectation. Experiments reveal that the most effective SFT samples are those with the highest sample effect, not necessarily those exhibiting the deepest reasoning patterns.

Theoretical Rationale for Re‑distillation Efficiency

Re‑distillation can be viewed as SFT on data generated by an RL‑trained policy. Since RL increases the sample effect of its output distribution, the distilled data are intrinsically more efficient for SFT, explaining why fewer than 1 K examples suffice to match RL‑level performance.

RL Exploration Relies on Early‑Token Shaping

During RL the output pattern shifts from the tail forward: later tokens change first, while early tokens—having a large impact on the final answer—remain hard to modify. Small‑scale SFT reshapes the early‑token distribution, providing a far more effective exploration signal than random RL exploration.

Experimental overview diagram
Experimental overview diagram

Preliminary Exploration: Impact of Small‑Scale SFT on RL

We define small‑scale SFT as fine‑tuning with ≤2 K examples to isolate the effect of the model’s intrinsic ability. Experiments on the K&K dataset using Qwen2.5‑1.5B and the GRPO RL algorithm show:

Only the policy distilled from DeepSeek‑R1 with 1 K correct samples (long‑CoT) reaches 0.8 test accuracy after SFT.

Short‑CoT (program‑generated) performs worse, and the advantage of long‑CoT does not transfer to the MATH dataset, indicating dataset‑specific behavior.

SFT performance does not predict RL performance: all initializations achieve <10% accuracy after SFT on K&K, yet their RL convergence speeds differ markedly.

Theoretical Analysis

Stochastic Differential Equation (SDE) Model of RL

We model the simplest REINFORCE update with a binary 0‑1 reward as

∇θ L = (1/N) Σ_i ∇θ log πθ(a_i|s_i)·r_i

where N is the batch size, a_i the generated response, s_i the prompt, and r_i the reward. For large N, the central limit theorem yields a Gaussian gradient noise, allowing us to write the parameter dynamics as an SDE: dθ_t = A(θ_t) dt + B(θ_t) dW_t The drift term A(θ_t) captures the expected test‑accuracy growth rate. Decomposing the drift gives a non‑negative positive effect (growth as N→∞) and a typically negative noise effect (variance‑induced slowdown). Reducing learning rate or increasing buffer size weakens the noise effect, explaining why large learning rates and small batch sizes often cause RL failure.

Sample Effect and Test‑Accuracy Growth

Each training sample s contributes to the drift via an expectation‑weighted inner product, which we define as the sample effect :

e(s) = ⟨E[∇θ ℓ_ce(s)], E[∇θ ℓ_ce]⟩

where ℓ_ce is the cross‑entropy loss. The overall test‑accuracy growth rate is approximately the sum of sample effects over the dataset.

Optimal Target Distribution for SFT Distillation

We consider correctness‑filtered distillation: the target distribution p(a|s) samples from the model’s output and discards incorrect answers. Adding a KL‑regularization term, the optimal distribution (derived via DPO‑style reasoning) is: p*(a|s) ∝ exp(β·Z(a,s)) where Z(a,s) depends on sample correctness and β controls the trade‑off. If sample a₁ has a larger sample effect than a₂, a sufficiently small β makes p*(a₁|s) > p*(a₂|s). Thus, training on samples with higher sample effect maximizes SFT efficiency. The KL constraint also implies that after correctness filtering, the overall accuracy of the target policy does not affect the optimal distribution.

RL Increases Sample Effect

We define the dataset effect as the expected sample effect of a policy over a filtered dataset. Using the same SDE framework, we prove that after the first RL iteration the dataset effect grows at least quadratically with the increase in reward, i.e., Δ(dataset effect) ≥ (Δ reward)^2 This growth stems from correctness filtering and holds regardless of whether the reward decreases (which cannot happen under the drift‑only approximation).

Re‑distillation Procedure

The method consists of three steps:

Perform small‑scale SFT (< 2 K examples) on the base model to obtain an SFT‑ed model.

Run RL (GRPO) on the SFT‑ed model to produce a target policy.

Sample a small set of ≤1 K outputs from the RL‑trained policy, then fine‑tune the original base model on this distilled data (identical to ordinary SFT). The resulting model is the re‑distilled model.

Empirically, re‑distillation reaches RL‑level test accuracy on K&K with only 1 K examples, achieving a 5× efficiency boost (80% accuracy within 25 RL steps). On MATH, using 496 distilled examples yields accuracy comparable to instruct‑tuned models and identical reward curves thereafter.

Re‑distillation results diagram
Re‑distillation results diagram

Validation of the Linearization Assumption

To test whether the single‑step linear approximation predicts multi‑step behavior, we compute sample‑effect‑based growth rates for both SFT and RL and compare them with actual training curves. Adjustments include:

Computing sample effect under both SGD and Adam (Adam simulated with batch size 1 for SFT, batch size 20 for RL to match the real 64:1024 ratio).

Incorporating gradient clipping and the GRPO loss.

Using a fixed test‑set gradient direction: the parameter difference between the base model and its checkpoint after 25 RL steps on Qwen2.5‑1.5B.

Evaluating instruct, long‑CoT, and short‑CoT under this fixed direction to avoid self‑bias.

Sample‑effect vs. actual growth curves
Sample‑effect vs. actual growth curves

The left panel (SFT) shows that re‑distillation achieves the highest estimated growth, matching empirical SFT results. The right panel (RL) shows the ordering long‑CoT > short‑CoT > re‑distill, consistent with observed reward curves. Overall, the linearized sample‑effect estimates align with real training trends, supporting the theoretical framework.

Exploration Dilemma: Why SFT Improves Long‑Term Exploration

We analyze token‑level log‑probability quantiles in the RL replay buffer of long‑CoT‑math. For each token position we compute the 1 % quantile of log‑probabilities under the initial policy. Over RL steps the quantile curve shifts forward (earlier positions improve), indicating that RL first modifies later tokens and only later affects early tokens.

Token‑position log‑prob quantile shift
Token‑position log‑prob quantile shift

In contrast, SFT directly boosts early‑token probabilities. This complementary behavior demonstrates that, when prior knowledge about desirable exploration patterns exists, modifying early‑token distributions via small‑scale SFT is far more effective than relying on RL’s stochastic exploration.

Discussion

The study confirms three key points: (1) RL is highly compressible, allowing small‑scale SFT to replace extensive RL sampling; (2) a sample‑effect metric predicts which SFT examples are most beneficial; (3) re‑distillation can reproduce RL‑level performance with <1 K examples, yielding a 5× efficiency gain on the K&K benchmark. Remaining open questions include refining the linearization hypothesis under distribution shift, improving early‑checkpoint filtering to reduce the initial RL cost, and exploring the impact of larger‑scale SFT.

large language modelsreinforcement learningRe-distillationSample Effecttheoretical analysis
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.