Artificial Intelligence 11 min read

From Traditional RL to LLM RL: Theory Derivation and Practical Engineering Improvements

This article walks through the fundamental derivation of policy‑based reinforcement learning, explains how traditional RL concepts extend to large‑language‑model RL, and details engineering enhancements such as GRPO memory reduction, asynchronous rollout, importance‑sampling corrections, and token‑flow management for stable industrial‑scale training.

Machine Learning Algorithms & Natural Language Processing

Mar 1, 2026

From Traditional RL to LLM RL: Theory Derivation and Practical Engineering Improvements

Traditional Policy‑Based RL Derivation

Reinforce maximizes the expected reward of trajectories generated by a policy. The objective is \(J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)]\). Gradient ascent updates the parameters:


abla_\theta J = \mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)
abla_\theta \log \pi_\theta(\tau)]

The derivation expands the expectation (step (1)→(2)), applies the log‑gradient trick (step (3)→(4)), and rewrites the result as an expectation again (step (5)→(6)). Reinforce leads to Actor‑Critic, TRPO and PPO; the clipping term of PPO is described in the original PPO paper https://arxiv.org/abs/1707.06347.

Mapping Traditional RL to LLM RL

State : the partially generated token sequence.

Action : selecting the next token from a vocabulary of tens of thousands.

Reward : a scalar provided by a reward model that evaluates the generated sequence.

These definitions replace the abstract environment of classic RL with the token‑generation process of LLMs.

GRPO Algorithm and Memory Savings

GRPO computes advantage values using group‑wise rewards, removing the separate critic and reward‑model components of PPO. This reduces GPU memory consumption dramatically, enabling the RLVR (Reinforcement Learning with Verifiable Reward) paradigm to become prominent in 2025. The GRPO formula is referenced in the article but not reproduced.

Industrial‑Scale LLM RL Challenges

When memory and runtime are not constrained, LLM RL consists of three steps: (1) sampling trajectories and rewards, (2) forward pass to compute log‑probabilities, (3) backward pass to update the model. Sampling is the dominant cost, so production‑grade frameworks separate the training engine (e.g., FSDP or Megatron) from the inference/sampling engine (e.g., VLLM or SGLang).

Asynchronous Rollout Acceleration

Pure on‑policy rollout executes sampling and forward‑backward passes sequentially, leading to low GPU utilization. Most frameworks introduce asynchronous rollout using Python asyncio to maintain a buffer queue of completed trajectories. Actors pull batches from the buffer for forward and backward computation, achieving pipeline‑style full‑GPU usage. This introduces off‑policy behavior, which can affect convergence.

To mitigate drift, frameworks expose a “staleness” parameter or limit buffer size, and may apply importance‑sampling corrections.

Importance‑Sampling (TIS) Corrections

VLLM/SGLang sampling engines may produce log‑probabilities that differ from those computed by the training engine, creating a mismatch that slows convergence. A short‑term importance‑sampling (TIS) factor multiplies the objective by a correction ratio, restoring consistency. Detailed derivations are available in the TIS blog https://fengyao.notion.site/off-policy-rl.

For Mixture‑of‑Experts (MoE) models, the mismatch can also involve the expert router. Extensions such as IcePop https://www.emergentmind.com/topics/icepop and MiniRL https://arxiv.org/pdf/2512.01374v1 add a per‑token mask to clip the TIS factor, further stabilizing MoE‑based RL.

Token‑Flow Maintenance in Multi‑Turn RL

In agentic RL, a loop repeatedly calls the LLM to generate assistant replies and then invokes env.step() for feedback. Using raw text as the communication medium can cause tokenizer merges to differ across turns, leading to anomalous token probabilities and eventual training collapse.

The solution is to maintain a persistent token‑id list on the training side, updating it incrementally with new environment feedback rather than replacing the entire sequence. This preserves stability across many dialogue turns.

References

[1] Zhao Shiyu, “Mathematical Principles of Reinforcement Learning” (course video) – https://www.bilibili.com/video/BV1sd4y167NS/

[2] Blog covering Reinforce, Actor‑Critic, PPO – https://medium.com/data-science/understand-reinforce-actor-critic-and-ppo-in-one-go-2569f520c066

[3] PPO paper – https://arxiv.org/abs/1707.06347

[4] AgentRL – https://arxiv.org/abs/2510.04206

[5] AReal – https://arxiv.org/abs/2505.24298

[6] TIS blog – https://fengyao.notion.site/off-policy-rl

[7] IcePop – https://www.emergentmind.com/topics/icepop

[8] MiniRL – https://arxiv.org/pdf/2512.01374v1

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

RLHF Importance Sampling GRPO Asynchronous Rollout

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.