Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design

This article explains how the open‑source AReaL framework boosts large‑scale reinforcement learning by separating agent execution from training logic, introducing a decoupled Agentic RL service and a Single‑Controller architecture that improves data flow, fault tolerance, and GPU utilization.

AntTech
AntTech
AntTech
Unlock Scalable RL: AReaL’s Decoupled Agentic Framework & Single‑Controller Design

Overview

AReaL is an open‑source reinforcement‑learning (RL) framework designed for large‑scale models (e.g., trillion‑parameter Ring‑1T). It provides a lightweight API and an extensible plugin system that decouples agent execution from RL training, enabling developers to focus on algorithm design.

Decoupled Agentic RL

Traditional agentic RL tightly couples training logic with the agent, making reuse and debugging difficult. AReaL adopts an "Agent Autonomy + RL as Observer" design:

Agent Autonomy : The agent is a pure LLM‑based decision system that receives inputs, calls tools, generates actions, and returns results without awareness of the training process.

RL as Observer : AReaL records each interaction as a trajectory (input, thought chain, action, observation, reward) for downstream RL algorithms.

Workflow

Agent launch & proxy wrapper – Users implement a single async function async def run_agent_return_reward(data: Any) -> float: that runs the agent and returns a scalar reward.

Trajectory collection – AReaL opens a session for each run, caching the input query, LLM token outputs, tool results, and the computed reward.

RL training – Collected trajectories are sorted, discounted (using a user‑specified discount factor), and fed to any standard RL algorithm. Updated policy weights are sent back to the agent.

Model deployment & closed‑loop iteration – Trained models can be exported in HuggingFace format and deployed without code changes; new interactions are continuously collected to form a feedback loop.

Agent interface example

async def run_agent_return_reward(data: Any) -> float:
    """Run the agent on a single data sample and return the reward.

    Args:
        data: An element from the training dataset.
    Returns:
        reward: Float value representing the episode reward.
    """
    # User‑defined logic that calls the LLM, interacts with tools, etc.
    ...

Single‑Controller Architecture

The classic SPMD execution model suffers from long‑tail tasks and coarse‑grained control, limiting throughput and fault recovery in RL workloads. AReaL replaces it with a layered "Controller + Distributed Engine" design that separates control‑plane logic from data‑plane processing.

Controller (CPU node) : Handles distributed scheduling, data aggregation, and exposes the same interface as the engine.

Worker : Executes the engine; can be co‑located with the engine or run in a separate process. It abstracts distributed data flow via DistributedBatch metadata.

Engine : Performs parallel computation and is compatible with native SGLang, FSDP, Megatron, etc.

Metadata structures

from dataclasses import dataclass, field

@dataclass
class TensorMetadata:
    """Metadata for a tensor field."""
    shape: tuple[int, ...]
    dtype: str
    device: str = "cpu"

@dataclass
class ShardMetadata:
    """Metadata for a single (sub‑)shard stored on one node."""
    node_id: str
    node_addr: str
    shard_id: str
    batch_size: int
    offset: int = 0
    fields: dict[str, TensorMetadata] = field(default_factory=dict)

@dataclass
class BatchMetadata:
    """Metadata for a distributed batch sharded across multiple nodes."""
    batch_id: str
    global_step: int
    total_batch_size: int
    shards: list[ShardMetadata] = field(default_factory=list)

Data‑flow RL process

Rollout Controller gathers metadata from inference engines.

Train Controller shards the metadata according to the data‑parallel strategy and dispatches shards to Workers.

Workers lazily pull required tensors via RPC, avoiding full tensor transfer.

This metadata‑driven approach eliminates the single‑point bottleneck of SPMD and improves scalability.

API Demo

Launching with the built‑in launcher:

# python3 -m areal.launcher.local script.py --config xxx.yaml

def main(args):
    actor = FSDPPPOActor(config=config.actor)
    actor.create_process_group(parallel_strategy=parallel_strategy)
    rollout = RemoteSGLangEngine(config.rollout)
    rollout.initialize(train_data_parallel_size=parallel_strategy.dp_size)
    # Load data on head rank and broadcast
    batch = None
    if actor.is_data_parallel_head():
        batch = rollout.prepare_batch(...)
        batch = tensor_container_to(batch, actor.device)
    batch = broadcast_tensor_container(
        batch,
        src_rank=actor.current_data_parallel_head(),
        group=...
    )

Using the controller directly (no launcher needed):

# python script.py --config xxx.yaml

def main(args):
    actor = TrainController(
        engine=FSDPPPOActor(config=config.actor),
        scheduler=LocalScheduler(...)
    )
    rollout = RolloutController(
        engine=RemoteSGLangEngine(config=rollout),
        scheduler=LocalScheduler(...)
    )
    batch = rollout.prepare_batch(...)
    # Controller automatically handles data distribution

Future Outlook

AReaL currently supports basic agentic RL pipelines and the single‑controller mode. Planned enhancements include:

High‑efficiency data flow and distributed startup for the Single Controller mode.

Automatic scaling, fault‑tolerant high‑availability training.

Trajectory versioning, visualization platform, and richer analytics.

Further performance optimizations for large‑scale agentic scenarios.

Repository: https://github.com/inclusionAI/AReaL

reinforcement learningDistributed TrainingAgentic AIScalable RLOpen-source
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.