Why Base‑Model RL Beats Traditional SFT‑RL: Theory, Practice, and Zero‑RL Insights

The article analyzes how applying reinforcement learning directly on base LLMs offers theoretical advantages, practical guidance, and experimental evidence that surpasses conventional cold‑start SFT‑RL pipelines, while also exploring zero‑RL approaches, KL constraints, and scaling considerations.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Why Base‑Model RL Beats Traditional SFT‑RL: Theory, Practice, and Zero‑RL Insights

Theoretical Background

Combining a policy‑gradient objective with a KL‑constraint can be reformulated as a residual‑energy‑based model. This reformulation turns the problem into efficient sampling from an optimal distribution.

Two practical formulations arise:

Treat the problem as reinforcement learning (RL): use a parameterized policy to approximate the optimal distribution.

Treat the problem as a sampling task: employ advanced MCMC methods to draw directly from the optimal distribution.

Base‑Model RL (Base‑RL)

Base‑RL fits a parametric distribution to the optimal distribution while keeping the base model’s response characteristics unchanged. The optimal distribution’s samples reveal patterns that can be used to:

Identify coverage gaps in the base model’s data distribution.

Correct stubborn behaviours (e.g., limited reasoning styles).

After Base‑RL, sampling from the learned optimal distribution should produce responses that are tightly aligned with the base model, providing a better starting point for subsequent instruction‑tuning.

Standard RL algorithms (PPO, SAC, REINFORCE, etc.) can be applied, but LLM‑RL behaves more like a bandit problem because each generation yields a single reward without multi‑step environment interaction.

Zero‑RL on the Base Model

Open‑source projects such as simple‑reason‑rl and tiny‑zero demonstrate that zero‑RL is feasible, yet many classic RL tricks must be re‑evaluated for LLMs with verified reward models.

Key reproducibility criteria:

Training must remain stable for thousands of steps (toy examples with only a few steps are insufficient).

The reward should increase monotonically together with response length; otherwise the process collapses to a short‑COT style RL.

Performance should match the DeepSeek‑R1 report’s Qwen‑25‑32B‑zero baseline when using a 32B model.

Experimental Findings (7B‑32B Models)

Different RL algorithms (PPO, SAC, REINFORCE, etc.) show only minor performance differences; learning‑rate and warm‑up have limited impact.

Reward and response‑length growth are highly sensitive to the prompt template.

Adding a KL‑constraint often causes early saturation and limits response‑length growth.

The simplest methods (REINFORCE without KL, PPO without KL) tend to be the most effective.

Empirical curves show that REINFORCE without KL yields stable, continuous reward and length growth, whereas KL‑constrained variants saturate quickly.

Stability Enhancements

When reward declines on the training set, the following techniques can improve stability:

Policy EMA (exponential moving average of the policy parameters).

Reference EMA (EMA of a reference policy used for advantage calculation).

Alternative advantage estimators.

Training Configuration Details

Dataset: math prompts with difficulty 6‑9.

Reward schema:

correct answer          ->  +1.0
incorrect format       ->  -1.0
answer wrong & format correct ->  0 or -0.5

When the KL coefficient is set to zero (e.g., init_kl_coef=0 in OpenRLHF), the training curve exhibits stable, simultaneous growth of reward and response length.

Conversely, enabling a KL term leads to rapid saturation and limited length increase.

Practical Recommendations

Base‑RL should be viewed as fitting a shaped distribution; analysing the sampled patterns can guide data‑level improvements for the base model.

After Base‑RL, fine‑tuning on the same dataset with the parameterized optimal distribution is expected to yield a stronger instruction‑tuned model.

Prefer simple RL objectives (e.g., REINFORCE or PPO without KL) over more complex KL‑regularized variants.

Carefully design prompt templates; inappropriate templates can force the model into an undesired “instruct‑style” regime.

Monitor reward and response‑length jointly; if reward drops, consider EMA‑based smoothing or revised advantage calculations.

Broader Outlook

Beyond RL, energy‑based model optimization and advanced sampling techniques can be explored on platforms such as OpenRLHF or VERL to develop novel LLM‑driven agents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

base-model RLKL constraintzero-shot RL
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.