Which Agentic RL Framework Wins? A Deep Dive into AReal, Seer, Slime & verl

This article analyzes the training‑efficiency challenges of multi‑turn agentic reinforcement learning and compares four recent open‑source frameworks—AReal (Ant), Seer (Moonshot), Slime (Zhipu) and verl (Bytedance)—examining their asynchronous inference designs, rollout‑train separation, long‑context handling, off‑policy mitigation, and system‑level optimizations to guide framework selection.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Which Agentic RL Framework Wins? A Deep Dive into AReal, Seer, Slime & verl

Background and Motivation

The article analyses the efficiency bottlenecks of multi‑turn agentic reinforcement learning (RL) and surveys recent system designs that address rollout‑train coordination, long‑tail sample handling, and off‑policy drift.

Pre‑training vs. Post‑training Efficiency

During the pre‑training era, scaling laws make FLOPs the primary bottleneck. Systems such as Megatron and FSDP maximize Model FLOPs Utilization (MFU) with tensor/sequence/expert parallelism, operator fusion, and overlapping communication/computation.

In the post‑training era, especially for algorithms like GRPO, the bottleneck shifts to rollout time. In agentic RL more than 80% of wall‑clock time is spent on rollout, and the autoregressive nature of test‑time scaling prevents linear speed‑up by simply adding GPUs.

Rollout depends heavily on the previous action, making parallel scaling difficult.

Rollout itself is a heavy GPU workload, so efficient train‑rollout coordination becomes a system‑level challenge.

Agentic RL Training Bottlenecks

Agentic RL can be divided into:

Single‑turn RL : One interaction, output [prompt, response].

Multi‑turn RL : Multiple interactions, output an interleaved sequence such as [prompt, action1, obs1, action2, obs2, …, response]. For a 32B model a naïve custom framework can exceed one hour per training step.

The main pain points observed in rollout‑train pipelines are:

Long context : Lengthy chain‑of‑thought (CoT) outputs and deep tool‑call horizons increase decode time.

Bubble : GPU idle time while waiting for other tasks, analogous to pipeline bubbles in pre‑training.

Long‑tail effect : Batches contain trajectories of highly variable length; the longest trajectory determines batch completion, causing idle GPUs.

Long tool‑call execution : Serial calls to sandboxes, databases, or retrieval services cannot be easily compressed.

Conventional Acceleration Techniques

Reuse inference optimizations from vLLM and SGLang , apply FP8 quantization, and adopt improved speculative decoding.

Reduce total sequence length with context managers (e.g., Qwen AgentFold) or by mixing long/short CoT during training.

RL‑Specific System Strategies

Asynchronous : Sacrifice some on‑policy fidelity for extreme throughput (e.g., replay‑buffer‑driven partial rollout).

Disaggregated : Fully decouple rollout and training, enabling smooth switching and zero redundancy (Impala‑style).

Synchronous (Load‑balanced) : Preserve on‑policy behavior while modeling rollout as load‑balanced tasks to eliminate bubbles.

AReal: Fully Asynchronous Design

AReal (Ant) follows an Impala/A3C‑style fully asynchronous architecture. The core ideas are:

Stream Rollout : Rollout runs on separate hardware and always uses the latest policy, reducing bubble to near zero and allowing heterogeneous GPUs (e.g., H800 for training, L40/A10 for inference).

Staleness‑aware PPO : Introduces a decoupled PPO objective with double‑layer importance sampling and a tunable staleness control parameter eta that bounds policy version lag.

Interruptible Generation : When the replay buffer is low, long rollouts can be paused and short ones prioritized to keep batch size stable.

System optimizations include CPU offload of reward computation, asyncio‑driven high‑concurrency rollout, and dynamic memory allocation for token‑balanced micro‑batches.

Seer: Load‑Balanced Synchronous Design

Seer (Moonshot) keeps strict on‑policy A2C training while applying aggressive engineering to eliminate bubbles.

Divided Rollout : Long requests are split into smaller chunks; scheduling at the chunk level fills GPU gaps.

Global KV Cache (Mooncake) : A disaggregated KV cache enables request migration across GPUs without repeating the prefill phase.

Context‑Aware Scheduling : Predicts the maximum generation length from the prompt and prioritizes long tasks, achieving up to 87% tail‑latency reduction.

Adaptive Grouped Speculative Decoding (AGSD) : Builds a compressed suffix tree from fast requests to serve as a draft model for slower ones, avoiding draft‑model staleness.

Verl: Hybrid Flow

Verl (Bytedance) provides an open‑source ecosystem with verl‑agent. Recent releases add fully asynchronous training and a decoupled PPO similar to AReal, with the following features:

AgentLoop for multi‑turn training.

Off‑policy control mechanisms.

Dynamic staleness_threshold (e.g., 0.5 allows up to half‑epoch lag).

Partial rollout / sleep‑resume to interrupt long tasks without token waste.

Slime: Hybrid Framework for MoE

Slime (Zhipu) is a lightweight MoE‑focused framework offering flexible synchronous and asynchronous modes.

Colocated Synchronous : Ideal for tasks requiring strict on‑policy behavior (e.g., math proofs).

Decoupled Asynchronous : Prevents environment interaction from blocking training for complex long‑execution agents.

Native integration with SGLang , inheriting community optimizations such as RadixAttention and Triton kernels.

Active Partial Rollouts : Over‑provision inference, terminate excess requests early while retaining KV cache for the next batch.

Framework Comparison (concise)

AReal (Ant) – Off‑policy, training with Megatron/FSDP, inference via vLLM/SGLang, orchestrated by Ray.

Slime (Zhipu) – Hybrid, Megatron training, native SGLang inference, Ray orchestration.

Verl (Bytedance) – Hybrid, Megatron/FSDP training, vLLM/SGLang inference, Ray orchestration.

Seer (Moonshot) – On‑policy, Megatron training, custom vLLM inference, orchestrated with K8s + Ray.

Selection Guidelines

Complex long‑call agents (web search, code execution) benefit from fully asynchronous frameworks such as AReal or Slime’s async mode; larger sample sizes can offset off‑policy accuracy loss.

Strict logical‑reasoning tasks (math, coding) require on‑policy consistency; Seer or Verl’s synchronous/half‑async modes are preferable, with Seer currently offering the highest performance ceiling.

Training massive MoE models aligns best with Slime, which provides the most complete SGLang‑native MoE support.

Small teams or beginners may start with Verl (balanced feature set and active community) or Slime (lightweight and easy to modify).

Defining an Industrial‑Grade Agentic RL Framework

An industrial‑grade solution should exhibit:

Native support for Ray (or equivalent) to enable seamless train‑rollout decoupling and flexible sync/async switching.

Deep integration of inference accelerators (vLLM, SGLang), FP8 or lower‑precision quantization, and MoE‑specific routing/replay mechanisms.

Advanced long‑context management: partial rollouts, global KV cache, context compression, and extensible context managers.

large language modelsframework comparisonTraining EfficiencyAgentic RLAsynchronous InferenceRL Systems
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.