Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive
This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.
Background
Reinforcement learning (RL) is increasingly used to improve the reasoning and problem‑solving abilities of large language models (LLMs). Because language is highly contextual, RL for LLMs typically relies on a sequence‑level reward that assigns a scalar score to an entire generated response.
Problem
Most mainstream RL algorithms (e.g., REINFORCE, GRPO) optimize a token‑level objective while the reward is defined at the sequence level. This mismatch raises concerns about theoretical soundness and training stability, especially for Mixture‑of‑Experts (MoE) models where dynamic expert routing further complicates token‑level importance sampling.
Key Insight
The Alibaba Qianwen team proposes that the expectation of a sequence‑level reward can be approximated by a surrogate token‑level objective, provided two sources of bias are sufficiently small:
Numerical discrepancy between training and inference engines.
Distribution shift between the rollout policy used for sampling and the target policy being optimized.
Under these conditions, importance‑sampling (IS) weights naturally appear in the surrogate objective, clipping limits policy changes, and a technique called Routing Replay fixes expert routing during optimization.
Mathematical Formulation
Let the autoregressive LLM be a policy π_θ. For a prompt x and response y, the likelihood is
The expected sequence‑level reward R(x, y) is
Using importance sampling to bridge the training engine and inference engine, the surrogate token‑level objective becomes
Its gradient is
The gradient matches the basic REINFORCE update with token‑level IS weights, confirming that the surrogate is a first‑order approximation of the true sequence‑level objective.
Conditions for the Approximation
The first‑order approximation holds when the target policy π_θ and the rollout policy μ_{θ_{old}} are sufficiently close. The IS weight can be expressed as a ratio of the two policies:
Challenges in MoE Models and Routing Replay
In MoE models, expert routing changes dynamically for each token, breaking the closeness assumption. The token‑level IS weight for MoE becomes
To restore the approximation, the authors introduce Routing Replay , which fixes the expert routing during policy updates. Two variants are described:
Vanilla Routing Replay (R2) : Replays the exact routing decisions made by the rollout policy during gradient computation.
Rollout Routing Replay (R3) : Aligns the routing used in the training engine with that of the inference engine, reducing both training‑inference discrepancy and policy staleness.
Experimental Setup
The team fine‑tuned a 30‑billion‑parameter MoE model (derived from Qwen3‑30B‑A3B‑Base) using the proposed RL formulation. Training used BF16 precision, while inference employed FP8, creating a severe training‑inference gap. A dataset of 4,096 verified math problems served as prompts, and rewards were binary (correct/incorrect). Evaluation was performed on HMMT25, AIME25, and AIME24 benchmarks, sampling 32 responses per problem. The authors also monitored token‑level entropy of the target policy and KL divergence between rollout policies in training and inference.
Key Findings
On‑policy training with importance‑sampling‑corrected REINFORCE (MiniRL) achieved the highest stability and performance.
Adding length normalization degraded performance, confirming that it breaks the first‑order approximation.
Removing IS correction caused rapid collapse and entropy drop.
In off‑policy settings, both Routing Replay and clipping were essential; omitting either led to early failure.
Different cold‑start initializations converged to similar final performance, indicating that the RL methodology matters more than initialization details.
Conclusion
The study demonstrates that a token‑level surrogate objective, when paired with importance sampling and careful control of policy divergence, provides a theoretically sound and empirically effective way to stabilize RL for LLMs. In MoE architectures, Routing Replay mitigates routing‑induced instability, enabling stable and efficient reinforcement learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
