From Traditional RL to LLM RL: Theory Derivation and Practical Engineering Improvements
This article walks through the fundamental derivation of policy‑based reinforcement learning, explains how traditional RL concepts extend to large‑language‑model RL, and details engineering enhancements such as GRPO memory reduction, asynchronous rollout, importance‑sampling corrections, and token‑flow management for stable industrial‑scale training.
Traditional Policy‑Based RL Derivation
Reinforce maximizes the expected reward of trajectories generated by a policy. The objective is \(J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)]\). Gradient ascent updates the parameters:
abla_\theta J = \mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)
abla_\theta \log \pi_\theta(\tau)]The derivation expands the expectation (step (1)→(2)), applies the log‑gradient trick (step (3)→(4)), and rewrites the result as an expectation again (step (5)→(6)). Reinforce leads to Actor‑Critic, TRPO and PPO; the clipping term of PPO is described in the original PPO paper https://arxiv.org/abs/1707.06347.
Mapping Traditional RL to LLM RL
State : the partially generated token sequence.
Action : selecting the next token from a vocabulary of tens of thousands.
Reward : a scalar provided by a reward model that evaluates the generated sequence.
These definitions replace the abstract environment of classic RL with the token‑generation process of LLMs.
GRPO Algorithm and Memory Savings
GRPO computes advantage values using group‑wise rewards, removing the separate critic and reward‑model components of PPO. This reduces GPU memory consumption dramatically, enabling the RLVR (Reinforcement Learning with Verifiable Reward) paradigm to become prominent in 2025. The GRPO formula is referenced in the article but not reproduced.
Industrial‑Scale LLM RL Challenges
When memory and runtime are not constrained, LLM RL consists of three steps: (1) sampling trajectories and rewards, (2) forward pass to compute log‑probabilities, (3) backward pass to update the model. Sampling is the dominant cost, so production‑grade frameworks separate the training engine (e.g., FSDP or Megatron) from the inference/sampling engine (e.g., VLLM or SGLang).
Asynchronous Rollout Acceleration
Pure on‑policy rollout executes sampling and forward‑backward passes sequentially, leading to low GPU utilization. Most frameworks introduce asynchronous rollout using Python asyncio to maintain a buffer queue of completed trajectories. Actors pull batches from the buffer for forward and backward computation, achieving pipeline‑style full‑GPU usage. This introduces off‑policy behavior, which can affect convergence.
To mitigate drift, frameworks expose a “staleness” parameter or limit buffer size, and may apply importance‑sampling corrections.
Importance‑Sampling (TIS) Corrections
VLLM/SGLang sampling engines may produce log‑probabilities that differ from those computed by the training engine, creating a mismatch that slows convergence. A short‑term importance‑sampling (TIS) factor multiplies the objective by a correction ratio, restoring consistency. Detailed derivations are available in the TIS blog https://fengyao.notion.site/off-policy-rl.
For Mixture‑of‑Experts (MoE) models, the mismatch can also involve the expert router. Extensions such as IcePop https://www.emergentmind.com/topics/icepop and MiniRL https://arxiv.org/pdf/2512.01374v1 add a per‑token mask to clip the TIS factor, further stabilizing MoE‑based RL.
Token‑Flow Maintenance in Multi‑Turn RL
In agentic RL, a loop repeatedly calls the LLM to generate assistant replies and then invokes env.step() for feedback. Using raw text as the communication medium can cause tokenizer merges to differ across turns, leading to anomalous token probabilities and eventual training collapse.
The solution is to maintain a persistent token‑id list on the training side, updating it incrementally with new environment feedback rather than replacing the entire sequence. This preserves stability across many dialogue turns.
References
[1] Zhao Shiyu, “Mathematical Principles of Reinforcement Learning” (course video) – https://www.bilibili.com/video/BV1sd4y167NS/
[2] Blog covering Reinforce, Actor‑Critic, PPO – https://medium.com/data-science/understand-reinforce-actor-critic-and-ppo-in-one-go-2569f520c066
[3] PPO paper – https://arxiv.org/abs/1707.06347
[4] AgentRL – https://arxiv.org/abs/2510.04206
[5] AReal – https://arxiv.org/abs/2505.24298
[6] TIS blog – https://fengyao.notion.site/off-policy-rl
[7] IcePop – https://www.emergentmind.com/topics/icepop
[8] MiniRL – https://arxiv.org/pdf/2512.01374v1
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
