AReaL 2.0 Launch: Micro‑Service Architecture Brings Online RL to Agent Applications
AReaL 2.0 re‑architects agentic reinforcement learning as a set of decoupled micro‑services, allowing existing agents to join an online RL loop with minimal code changes while addressing engineering gaps such as data conversion, multi‑turn modeling, and weight synchronization.
Large‑model agents are moving from isolated calls to complex systems that plan, invoke tools, manage memory, and handle multi‑turn interactions. When deployed, their capabilities become static, even though millions of real‑world interactions generate valuable learning signals that are hard to feed back into training pipelines.
The authors identify five practical obstacles: (1) a split between agent application code and RL training infrastructure; (2) difficulty converting live trajectories into training data; (3) challenges modeling multi‑turn conversations, tool calls, and delayed rewards; (4) lack of standardized collaboration among inference, training, and weight‑sync services; and (5) the high engineering cost of retrofitting existing agents for online RL.
To solve these problems, AReaL 2.0 introduces the core concept of RL as Micro‑Service . Training, inference, and weight‑update capabilities are each packaged as independent, deployable services that can be freely combined. This decoupling turns a monolithic RL system into a plug‑and‑play runtime for agents.
The modular design supports three established paradigms—stand‑alone training (SFT), combined training + inference (OPD), and full RL loops—while also enabling new research directions such as joint inference + agent services for lightweight self‑evolution of memory, system prompts, and skills.
Key service modules include:
Data processing : selects seed data where at least one external model succeeds and rewrites issue descriptions to match a golden patch.
Agent infra : a sandbox‑based distributed scheduler that runs tens of thousands of environment instances concurrently, with millisecond‑level fork startup and image warm‑up to avoid dirty data during RL.
Algorithm stabilization : the KPop strategy performs token‑level adaptive filtering to resolve log‑probability mismatches between training and inference engines; reward‑hacking is mitigated by disabling risky operations in the harness, ensuring token‑in‑token‑out alignment.
After roughly 800 training steps, the model shows a stable score increase, providing a concrete reference for reproducing the Claude Code Agent RL workflow, swapping custom task environments, or building bespoke software‑engineering agents.
Two practical demos illustrate the platform:
Claude Code Agent RL : a full‑stack example that combines algorithmic design with the new infrastructure.
Hermes Agent Online RL : demonstrates a black‑box integration where existing agents receive asynchronous training updates, evolve without restarts, and can replace the demo agent with any custom task while reusing the same decoupled architecture.
The project has been open‑sourced since May, joined the PyTorch Foundation ecosystem, and now collaborates with MindLab to deliver low‑compute, end‑to‑end RL‑as‑a‑service solutions. The accompanying technical report (arXiv:2607.01120) details the design and evaluation.
In summary, AReaL 2.0 upgrades the agentic RL stack to a micro‑service‑based, online learning platform that lowers integration barriers, supports large‑scale distributed execution, and lays a foundation for agents that continuously improve from real interactions under governance constraints.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
