Artificial Intelligence 8 min read

AReaL 2.0 Launch: Micro‑Service Architecture Brings Online RL to Agent Applications

AReaL 2.0 re‑architects agentic reinforcement learning as a set of decoupled micro‑services, allowing existing agents to join an online RL loop with minimal code changes while addressing engineering gaps such as data conversion, multi‑turn modeling, and weight synchronization.

AntTech

Jul 3, 2026

AReaL 2.0 Launch: Micro‑Service Architecture Brings Online RL to Agent Applications

Large‑model agents are moving from isolated calls to complex systems that plan, invoke tools, manage memory, and handle multi‑turn interactions. When deployed, their capabilities become static, even though millions of real‑world interactions generate valuable learning signals that are hard to feed back into training pipelines.

The authors identify five practical obstacles: (1) a split between agent application code and RL training infrastructure; (2) difficulty converting live trajectories into training data; (3) challenges modeling multi‑turn conversations, tool calls, and delayed rewards; (4) lack of standardized collaboration among inference, training, and weight‑sync services; and (5) the high engineering cost of retrofitting existing agents for online RL.

To solve these problems, AReaL 2.0 introduces the core concept of RL as Micro‑Service . Training, inference, and weight‑update capabilities are each packaged as independent, deployable services that can be freely combined. This decoupling turns a monolithic RL system into a plug‑and‑play runtime for agents.

The modular design supports three established paradigms—stand‑alone training (SFT), combined training + inference (OPD), and full RL loops—while also enabling new research directions such as joint inference + agent services for lightweight self‑evolution of memory, system prompts, and skills.

Key service modules include:

Data processing : selects seed data where at least one external model succeeds and rewrites issue descriptions to match a golden patch.

Agent infra : a sandbox‑based distributed scheduler that runs tens of thousands of environment instances concurrently, with millisecond‑level fork startup and image warm‑up to avoid dirty data during RL.

Algorithm stabilization : the KPop strategy performs token‑level adaptive filtering to resolve log‑probability mismatches between training and inference engines; reward‑hacking is mitigated by disabling risky operations in the harness, ensuring token‑in‑token‑out alignment.

After roughly 800 training steps, the model shows a stable score increase, providing a concrete reference for reproducing the Claude Code Agent RL workflow, swapping custom task environments, or building bespoke software‑engineering agents.

Two practical demos illustrate the platform:

Claude Code Agent RL : a full‑stack example that combines algorithmic design with the new infrastructure.

Hermes Agent Online RL : demonstrates a black‑box integration where existing agents receive asynchronous training updates, evolve without restarts, and can replace the demo agent with any custom task while reusing the same decoupled architecture.

The project has been open‑sourced since May, joined the PyTorch Foundation ecosystem, and now collaborates with MindLab to deliver low‑compute, end‑to‑end RL‑as‑a‑service solutions. The accompanying technical report (arXiv:2607.01120) details the design and evaluation.

In summary, AReaL 2.0 upgrades the agentic RL stack to a micro‑service‑based, online learning platform that lowers integration barriers, supports large‑scale distributed execution, and lays a foundation for agents that continuously improve from real interactions under governance constraints.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AReaL agent infrastructure online reinforcement learning distributed sandbox KPop strategy RL micro-service

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.