Key Lessons from Scaling Agent RL Training: Stability, Tooling, and Reward Design
Over recent months of extensive agent reinforcement‑learning experiments across search, data‑analysis, and multi‑source scenarios, the author shares twelve practical insights covering stability, environment‑reward‑algorithm priorities, tool‑call reliability, reward hacking pitfalls, evaluation alignment, and scaling tricks for larger models.
Over recent months I have run large‑scale reinforcement‑learning (RL) training for agentic systems such as search agents and data‑analysis agents. Experiments covered dense models, mixture‑of‑experts (MoE) models, single‑source data and multi‑source data pipelines, and yielded both successes and failures. The following distilled observations capture the most critical technical lessons.
1. Stability is the primary prerequisite for production‑grade RL
RL pipelines must remain stable for long‑running training runs to enable scaling. Instability leads to wasted GPU hours and costly experiment restarts. Our work addressed common sources of instability, including train‑inference mismatches and the use of PPO‑EWMA (exponential‑weighted‑moving‑average PPO) as a default optimizer. These configurations are now part of the standard production stack.
2. Agentic RL follows the traditional RL priority order
When decomposing RL into environment → reward → algorithm , reasoning‑centric RL often treats the algorithm as most important (algorithm > reward > environment). In practical agentic settings the order reverses: a reliable environment is the foundation, followed by a well‑designed reward signal, and finally the learning algorithm.
3. A robust tool‑calling environment is essential
Agentic tasks frequently require external tools (search APIs, data‑fetchers, code executors). We continuously monitor tool‑call failure rates and resolve issues before launching new RL scenarios. Ensuring the environment can handle high‑concurrency tool calls prevents exponential cost growth and removes a hard ceiling on model performance.
4. Using LLM‑as‑judge for reward can induce reward hacking
When verifiable rewards (e.g., exact math or code correctness) are unavailable, many pipelines substitute an LLM to judge outputs. This approach is prone to reward hacking: the model learns to exploit quirks of the judge rather than solving the intended task. In one month of training we observed three large, spurious spikes in test‑set scores that were later traced to such hacking.
5. Minimise manual reward engineering
Even when handcrafted reward functions are necessary, they should be iterated cautiously. Small changes can open new hacking pathways, so each revision must be validated with stress tests and sanity checks.
6. Align training and evaluation environments
If the evaluation environment cannot be made identical to the training environment (for example, it must use a different set of tools), verify that evaluation scores remain comparable. Mismatched toolsets can truncate model outputs, causing apparent score drops that do not reflect true training quality.
7. Comprehensive monitoring and pre‑deployment stress testing
Before conducting data or algorithm ablations, ensure both the environment and reward signals are fully instrumented. Run stress tests that simulate peak tool‑call concurrency and extreme reward distributions to confirm the system is ready for systematic experimentation.
8. PPO‑EWMA often outperforms standard on‑policy methods given sufficient resources
When GPU capacity permits, PPO‑EWMA typically yields higher final scores than vanilla on‑policy algorithms. Scaling batch size and group size further improves compute efficiency, learning speed, and exploration breadth.
9. RL grokning is a reproducible phenomenon worth systematic study
We repeatedly observed sudden, dramatic improvements in performance after prolonged plateaus—behaviour analogous to "grokking" in supervised learning. This effect appears especially pronounced in on‑policy experiments, suggesting a fertile research direction.
10. Tool‑level exploration must be tracked
When an environment offers multiple essential tools or files, log each tool’s invocation frequency and success rate. Failure to call a required tool can cap the agent’s capability ceiling, so automated alerts for under‑used tools are recommended.
11. Larger models converge faster in RL
Techniques that boost RL performance on small dense models (e.g., specialised learning‑rate schedules or reward shaping tricks) often become unnecessary for larger models, which tend to generalise and converge more quickly.
12. Continuing RL does not always require higher clipping thresholds
Even with a low initial entropy, raising the PPO clipping value is not mandatory. Instead, strengthen entropy through data selection or auxiliary losses; the model will naturally increase exploration, causing entropy to rise throughout training.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
