12 min read

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

AI Frontier Lectures

Dec 9, 2025

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

Background

Reinforcement learning (RL) is increasingly used to improve the reasoning and problem‑solving abilities of large language models (LLMs). Because language is highly contextual, RL for LLMs typically relies on a sequence‑level reward that assigns a scalar score to an entire generated response.

Problem

Most mainstream RL algorithms (e.g., REINFORCE, GRPO) optimize a token‑level objective while the reward is defined at the sequence level. This mismatch raises concerns about theoretical soundness and training stability, especially for Mixture‑of‑Experts (MoE) models where dynamic expert routing further complicates token‑level importance sampling.

Key Insight

The Alibaba Qianwen team proposes that the expectation of a sequence‑level reward can be approximated by a surrogate token‑level objective, provided two sources of bias are sufficiently small:

Numerical discrepancy between training and inference engines.

Distribution shift between the rollout policy used for sampling and the target policy being optimized.

Under these conditions, importance‑sampling (IS) weights naturally appear in the surrogate objective, clipping limits policy changes, and a technique called Routing Replay fixes expert routing during optimization.

Mathematical Formulation

Let the autoregressive LLM be a policy π_θ. For a prompt x and response y, the likelihood is

The expected sequence‑level reward R(x, y) is

Using importance sampling to bridge the training engine and inference engine, the surrogate token‑level objective becomes

Its gradient is

The gradient matches the basic REINFORCE update with token‑level IS weights, confirming that the surrogate is a first‑order approximation of the true sequence‑level objective.

Conditions for the Approximation

The first‑order approximation holds when the target policy π_θ and the rollout policy μ_{θ_{old}} are sufficiently close. The IS weight can be expressed as a ratio of the two policies:

Importance‑sampling weight decomposition

Challenges in MoE Models and Routing Replay

In MoE models, expert routing changes dynamically for each token, breaking the closeness assumption. The token‑level IS weight for MoE becomes

To restore the approximation, the authors introduce Routing Replay , which fixes the expert routing during policy updates. Two variants are described:

Vanilla Routing Replay (R2) : Replays the exact routing decisions made by the rollout policy during gradient computation.

Rollout Routing Replay (R3) : Aligns the routing used in the training engine with that of the inference engine, reducing both training‑inference discrepancy and policy staleness.

Experimental Setup

The team fine‑tuned a 30‑billion‑parameter MoE model (derived from Qwen3‑30B‑A3B‑Base) using the proposed RL formulation. Training used BF16 precision, while inference employed FP8, creating a severe training‑inference gap. A dataset of 4,096 verified math problems served as prompts, and rewards were binary (correct/incorrect). Evaluation was performed on HMMT25, AIME25, and AIME24 benchmarks, sampling 32 responses per problem. The authors also monitored token‑level entropy of the target policy and KL divergence between rollout policies in training and inference.

Key Findings

On‑policy training with importance‑sampling‑corrected REINFORCE (MiniRL) achieved the highest stability and performance.

Adding length normalization degraded performance, confirming that it breaks the first‑order approximation.

Removing IS correction caused rapid collapse and entropy drop.

In off‑policy settings, both Routing Replay and clipping were essential; omitting either led to early failure.

Different cold‑start initializations converged to similar final performance, indicating that the RL methodology matters more than initialization details.

Conclusion

The study demonstrates that a token‑level surrogate objective, when paired with importance sampling and careful control of policy divergence, provides a theoretically sound and empirically effective way to stabilize RL for LLMs. In MoE architectures, Routing Replay mitigates routing‑induced instability, enabling stable and efficient reinforcement learning.

large language models Mixture of Experts reinforcement learning Importance Sampling training stability routing replay sequence-level reward token-level surrogate

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.