From Traditional RL to LLM‑RL: Theory Derivation and Engineering Improvements

The article walks through the fundamentals of traditional policy‑gradient reinforcement learning, derives the Reinforce objective, maps its concepts to large‑language‑model RL, and then discusses practical engineering solutions such as GRPO, async rollout, importance‑sampling corrections, and token‑flow management for industrial‑scale training.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
From Traditional RL to LLM‑RL: Theory Derivation and Engineering Improvements

The author, a researcher who started working on LLM‑RL in early 2025, first reviews the basic principles of reinforcement learning, emphasizing that RL is an unsupervised learning paradigm that iteratively interacts with an environment and updates a policy based on reward feedback.

Focusing on policy‑based methods, the Reinforce algorithm is presented as the canonical example. Its objective—maximizing the expected reward of all trajectories—is expressed, and the gradient is derived step‑by‑step (expanding the expectation, applying the log‑trick, and rewriting as an expectation again). The author notes that Reinforce leads to later algorithms such as Actor‑Critic, TRPO, and PPO, and points to a blog post for further details.

The transition from traditional RL to LLM‑RL is then described. In the LLM setting, the state is the concatenation of the initial prompt and the generated token sequence, the action space is the vocabulary, the transition is deterministic (appending the chosen token), and the reward is a scalar provided by a reward model after sequence completion. The recent GRPO algorithm, introduced after the release of DeepSeek‑R1, computes advantage values using group‑wise rewards, eliminating the need for separate critic and reward models and dramatically reducing memory consumption. The author links to the GRPO paper (https://arxiv.org/pdf/2402.03300).

Industrial‑scale LLM‑RL faces two major challenges: the high computational cost of rollout and the need for fast inference. Consequently, most production frameworks separate training (using FSDP or Megatron) from inference (using VLLM or SGLang). To mitigate rollout bottlenecks, many frameworks adopt asynchronous rollout via Python asyncio, maintaining a buffer of sampled trajectories that actors consume for forward and backward passes. This pipeline‑like approach improves GPU utilization but introduces off‑policy drift; frameworks control this drift with a “staleness” parameter or buffer‑size limits and apply importance‑sampling corrections.

Importance‑sampling correction addresses mismatches between log‑probabilities computed by the training engine and the inference engine, which can arise from precision loss in VLLM/SGLang. The community’s TIS (Temporal Importance Sampling) technique multiplies the objective by a correction factor, and for Mixture‑of‑Experts (MoE) models, a per‑token mask is added to clip the correction, as described in IcePop and MiniRL.

When training agentic RL systems that involve multi‑turn interactions, the author highlights a stability issue: using the full token list as the communication medium can cause tokenizer merge inconsistencies across turns, leading to abnormal token probabilities and eventual training collapse. The recommended solution is to maintain an incremental token‑id stream, updating it only with new tokens after each environment step.

References:

[1] Reinforcement Learning Mathematics (Bilibili video)

[2] Blog on Reinforce, Actor‑Critic, PPO

[3] PPO paper (arXiv:1707.06347)

[4] AgentRL (arXiv:2510.04206)

[5] AReal (arXiv:2505.24298)

[6] TIS blog (https://fengyao.notion.site/off-policy-rl)

[7] IcePop (https://www.emergentmind.com/topics/icepop)

[8] MiniRL (https://arxiv.org/pdf/2512.01374v1)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMpolicy gradientImportance SamplingGRPOToken FlowAsync Rollout
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.