Boosting LLM Post-Training with RL: Tips for Efficiency and Stability
This article shares practical insights and pitfalls from six months of applying reinforcement learning to fine‑tune large language models, covering exploration efficiency, training stability, model selection, and special considerations for thinking‑oriented agents.
Exploration Efficiency
Reinforcement learning for LLMs is heavyweight because it typically requires loading several models simultaneously: the current model for log‑probability, a reference model, an old‑logprob record, a critic model, and often a reward model. The more models involved, the larger the pipeline bottlenecks and the lower the overall training throughput.
A common bottleneck is the synchronization interval between rollout and training. With Sync=1, one step of data collection is immediately followed by one step of training, yielding about 50% GPU utilization. Increasing the sync frequency (allowing more off‑policy data) can improve utilization but may cause instability; importance‑sampling variants such as GRPO were created to mitigate this, yet they do not guarantee crash‑free training.
Agent‑environment latency is another major issue. For example, a Webshop task with 32 parallel runners consumes roughly 1.7 TB of memory and heavy CPU for the retrieval stage, leading to high latency. Mobile or GUI agents are even more expensive. A practical workaround is to mock the mobile environment—using screenshots instead of real devices—so many runners can operate cheaply, though the simulated environment may miss corner cases.
Increasing the quantity and diversity of positive samples is beneficial, but data‑synthesis techniques that work for supervised fine‑tuning do not directly translate to RL because the log‑probabilities are computed under different contexts. Importance sampling may therefore be invalid; a technique called context distillation can help, though sometimes training works without explicit importance sampling.
Training Stability
RL does not scale as smoothly as pre‑training or supervised fine‑tuning; training may diverge after only a few thousand steps, with sudden spikes in entropy, KL divergence, reward, PPO loss, or output length. Reports from DeepSeek 3.2 and Qwen 3 show that RL can achieve reasoning ability with as few as 4 k data points, highlighting both data efficiency and scaling difficulty.
Stability problems often stem from infrastructure mismatches. Using vLLM or SGLang for inference can produce log‑probabilities that differ from HuggingFace’s results, leading to excessive clipping by importance sampling. A possible remedy is to recompute log‑probabilities with HuggingFace during the prefill stage.
Loss selection matters: sequence‑level loss (e.g., GSPO) converges more slowly but is more stable for dense models and offers optimizations for MoE models, whereas token‑level loss (e.g., DAPO) can fail on long‑sequence tasks. Setting an appropriate maximum output token length is crucial; overly large limits (e.g., 8192 tokens for a task that only needs 200) can cause runaway rollouts and crashes.
Small LLMs in multi‑turn agent RL tend to lose focus; repeating the original target and recent actions in the prompt each turn helps maintain direction. Larger sync values are not inherently worse—some experiments show Sync=10 outperforming Sync=1 —but fully asynchronous training should be paired with a priority buffer that surfaces newer data.
Example GPU utilization for a 1.5 B model on the Webshop task (16 runners, 8 cards for rollout, 2 cards for training): Sync=1 → 62% average utilization, Sync=5 → 72%, Sync=10 → 81%, Sync=20 → 85%.
If the model’s success rate on a task is low, avoid raw GRPO; instead increase the proportion of positive samples in the loss via token‑level or sample‑level filtering, or boost the advantage weight for positives. This counters the dominance of negative samples, which otherwise leads to collapse because the action space of LLM RL (vocabulary × sequence length) is enormous and poorly understood.
When a verifiable reward signal exists, PPO is usually unnecessary; PPO is best reserved for subjective tasks, while GRPO works better for objective ones because the critic model’s predictions are often unreliable, especially under data conflict.
Base Model Selection
For post‑training RL, start with Qwen 2.5 or Mistral (Instruct or Base) and perform a brief supervised fine‑tuning (SFT) cold start before RL. The <think> tag is not a distinct token in the vocabulary, so it is safer to replace it with a more natural token.
Avoid the Llama series for this purpose; their chain‑of‑thought capabilities are weaker during pre‑training, leading to odd RL‑generated conclusions. Qwen models already exhibit strong reasoning behavior after pre‑training, making them more suitable.
Thinking Model Post‑Training
Fine‑tuning a model that already exhibits “thinking” or reasoning behavior is especially tricky. Unlike standard instruction models, thinking models concatenate multiple single‑turn dialogues rather than maintaining a generic multi‑turn context.
For example, Qwen 3 removes the thinking segment of Turn 1 when processing Turn 2 to keep context length short. This breaks the standard GRPO workflow, which masks the input part of a long trajectory and back‑propagates from the tail. Consequently, papers and implementations that assume a conventional multi‑turn GRPO need to be re‑thought for such models.
Tool‑call heavy scenarios (e.g., Kimi) add further complexity: each turn may contain several tool calls, and the thinking fragments associated with earlier calls are often dropped in later turns, disrupting the continuity required for RL.
If the exact RL algorithm and context‑modification rules used during the original training of a thinking model are unknown, attempting post‑training carries high risk.
Temperature handling is another overlooked factor. While many open‑source models recommend sampling at a specific temperature and top‑p during inference, it is advisable to train at temperature 1 (or the official recommendation) and evaluate using the same settings.
Finally, token‑character conversion inconsistencies can appear in multi‑turn dialogues: a generated token sequence may change when converted to characters and back to tokens, leading to mismatches.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
