Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

Background

Reinforcement learning (RL) is increasingly used to improve the reasoning and problem‑solving abilities of large language models (LLMs). Because language is highly contextual, RL for LLMs typically relies on a sequence‑level reward that assigns a scalar score to an entire generated response.

Problem

Most mainstream RL algorithms (e.g., REINFORCE, GRPO) optimize a token‑level objective while the reward is defined at the sequence level. This mismatch raises concerns about theoretical soundness and training stability, especially for Mixture‑of‑Experts (MoE) models where dynamic expert routing further complicates token‑level importance sampling.

Key Insight

The Alibaba Qianwen team proposes that the expectation of a sequence‑level reward can be approximated by a surrogate token‑level objective, provided two sources of bias are sufficiently small:

Numerical discrepancy between training and inference engines.

Distribution shift between the rollout policy used for sampling and the target policy being optimized.

Under these conditions, importance‑sampling (IS) weights naturally appear in the surrogate objective, clipping limits policy changes, and a technique called Routing Replay fixes expert routing during optimization.

Mathematical Formulation

Let the autoregressive LLM be a policy π_θ. For a prompt x and response y, the likelihood is

Sequence‑level reward expectation
Sequence‑level reward expectation

The expected sequence‑level reward R(x, y) is

Importance‑sampling transformation
Importance‑sampling transformation

Using importance sampling to bridge the training engine and inference engine, the surrogate token‑level objective becomes

Surrogate token‑level objective
Surrogate token‑level objective

Its gradient is

Gradient of surrogate objective
Gradient of surrogate objective

The gradient matches the basic REINFORCE update with token‑level IS weights, confirming that the surrogate is a first‑order approximation of the true sequence‑level objective.

Conditions for the Approximation

The first‑order approximation holds when the target policy π_θ and the rollout policy μ_{θ_{old}} are sufficiently close. The IS weight can be expressed as a ratio of the two policies:

Importance‑sampling weight decomposition
Importance‑sampling weight decomposition

Challenges in MoE Models and Routing Replay

In MoE models, expert routing changes dynamically for each token, breaking the closeness assumption. The token‑level IS weight for MoE becomes

MoE token‑level IS weight
MoE token‑level IS weight

To restore the approximation, the authors introduce Routing Replay , which fixes the expert routing during policy updates. Two variants are described:

Vanilla Routing Replay (R2) : Replays the exact routing decisions made by the rollout policy during gradient computation.

Rollout Routing Replay (R3) : Aligns the routing used in the training engine with that of the inference engine, reducing both training‑inference discrepancy and policy staleness.

Vanilla Routing Replay architecture
Vanilla Routing Replay architecture
Rollout Routing Replay architecture
Rollout Routing Replay architecture

Experimental Setup

The team fine‑tuned a 30‑billion‑parameter MoE model (derived from Qwen3‑30B‑A3B‑Base) using the proposed RL formulation. Training used BF16 precision, while inference employed FP8, creating a severe training‑inference gap. A dataset of 4,096 verified math problems served as prompts, and rewards were binary (correct/incorrect). Evaluation was performed on HMMT25, AIME25, and AIME24 benchmarks, sampling 32 responses per problem. The authors also monitored token‑level entropy of the target policy and KL divergence between rollout policies in training and inference.

Key Findings

On‑policy training with importance‑sampling‑corrected REINFORCE (MiniRL) achieved the highest stability and performance.

Adding length normalization degraded performance, confirming that it breaks the first‑order approximation.

Removing IS correction caused rapid collapse and entropy drop.

In off‑policy settings, both Routing Replay and clipping were essential; omitting either led to early failure.

Different cold‑start initializations converged to similar final performance, indicating that the RL methodology matters more than initialization details.

Conclusion

The study demonstrates that a token‑level surrogate objective, when paired with importance sampling and careful control of policy divergence, provides a theoretically sound and empirically effective way to stabilize RL for LLMs. In MoE architectures, Routing Replay mitigates routing‑induced instability, enabling stable and efficient reinforcement learning.

large language modelsMixture of Expertsreinforcement learningImportance Samplingtraining stabilityrouting replaysequence-level rewardtoken-level surrogate
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.