Artificial Intelligence 10 min read

From Traditional RL to LLM‑RL: Theory Derivation and Engineering Improvements

The article walks through the fundamentals of traditional policy‑gradient reinforcement learning, derives the Reinforce objective, maps its concepts to large‑language‑model RL, and then discusses practical engineering solutions such as GRPO, async rollout, importance‑sampling corrections, and token‑flow management for industrial‑scale training.

Machine Learning Algorithms & Natural Language Processing

Feb 24, 2026

From Traditional RL to LLM‑RL: Theory Derivation and Engineering Improvements

The author, a researcher who started working on LLM‑RL in early 2025, first reviews the basic principles of reinforcement learning, emphasizing that RL is an unsupervised learning paradigm that iteratively interacts with an environment and updates a policy based on reward feedback.

Focusing on policy‑based methods, the Reinforce algorithm is presented as the canonical example. Its objective—maximizing the expected reward of all trajectories—is expressed, and the gradient is derived step‑by‑step (expanding the expectation, applying the log‑trick, and rewriting as an expectation again). The author notes that Reinforce leads to later algorithms such as Actor‑Critic, TRPO, and PPO, and points to a blog post for further details.

The transition from traditional RL to LLM‑RL is then described. In the LLM setting, the state is the concatenation of the initial prompt and the generated token sequence, the action space is the vocabulary, the transition is deterministic (appending the chosen token), and the reward is a scalar provided by a reward model after sequence completion. The recent GRPO algorithm, introduced after the release of DeepSeek‑R1, computes advantage values using group‑wise rewards, eliminating the need for separate critic and reward models and dramatically reducing memory consumption. The author links to the GRPO paper (https://arxiv.org/pdf/2402.03300).

Industrial‑scale LLM‑RL faces two major challenges: the high computational cost of rollout and the need for fast inference. Consequently, most production frameworks separate training (using FSDP or Megatron) from inference (using VLLM or SGLang). To mitigate rollout bottlenecks, many frameworks adopt asynchronous rollout via Python asyncio, maintaining a buffer of sampled trajectories that actors consume for forward and backward passes. This pipeline‑like approach improves GPU utilization but introduces off‑policy drift; frameworks control this drift with a “staleness” parameter or buffer‑size limits and apply importance‑sampling corrections.

Importance‑sampling correction addresses mismatches between log‑probabilities computed by the training engine and the inference engine, which can arise from precision loss in VLLM/SGLang. The community’s TIS (Temporal Importance Sampling) technique multiplies the objective by a correction factor, and for Mixture‑of‑Experts (MoE) models, a per‑token mask is added to clip the correction, as described in IcePop and MiniRL.

When training agentic RL systems that involve multi‑turn interactions, the author highlights a stability issue: using the full token list as the communication medium can cause tokenizer merge inconsistencies across turns, leading to abnormal token probabilities and eventual training collapse. The recommended solution is to maintain an incremental token‑id stream, updating it only with new tokens after each environment step.

References:

[1] Reinforcement Learning Mathematics (Bilibili video)

[2] Blog on Reinforce, Actor‑Critic, PPO

[3] PPO paper (arXiv:1707.06347)

[4] AgentRL (arXiv:2510.04206)

[5] AReal (arXiv:2505.24298)

[6] TIS blog (https://fengyao.notion.site/off-policy-rl)

[7] IcePop (https://www.emergentmind.com/topics/icepop)

[8] MiniRL (https://arxiv.org/pdf/2512.01374v1)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM policy gradient Importance Sampling GRPO Token Flow Async Rollout

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.