How Relax Powers Scalable Multi‑Modal RL Training with Full Asynchrony
Relax, an open‑source RL training engine built on Megatron‑LM and SGLang, tackles data heterogeneity, system fragility, and role coupling by using a service‑oriented fault‑tolerant architecture, asynchronous pipelines, and multimodal‑native support, achieving up to 76% end‑to‑end speedup over veRL.
Introduction
Today the Xiaohongshu AI Platform team open‑sourced Relax , a reinforcement‑learning (RL) training engine designed for full‑modal and agentic scenarios. Built on high‑performance back‑ends Megatron‑LM and SGLang, Relax unifies multimodal data support, service‑oriented fault‑tolerant architecture, and asynchronous training pipelines.
Key Challenges
Data heterogeneity : Large‑scale image, audio, and video data cause massive transmission volume, high CPU preprocessing cost, and token explosion, making existing parallel strategies inefficient.
System fragility : Multimodal training incurs high OOM risk and long‑running jobs, leading to frequent hardware failures and NCCL time‑outs; traditional solutions lack minute‑level fault recovery and per‑role elastic scaling.
Role coupling : In colocated setups, all roles share GPUs and must execute serially; even fully asynchronous designs lack fine‑grained pipeline scheduling.
Service‑Oriented Fault‑Tolerant Architecture
Relax encapsulates each RL role (Actor, Critic, Rollout, etc.) as an independent Ray Serve service, providing isolated fault domains, resource quotas, and health monitoring. This yields three core capabilities:
Fault isolation : A failure in one service (e.g., OOM) does not affect others; two‑level recovery distinguishes stateless (in‑place restart) from stateful (global recovery) roles.
Independent scaling : Roles can be scaled separately; for example, Rollout replicas can be increased without touching the Critic cluster.
Lifecycle management : Initialization, checkpointing, and restart are managed at the service level rather than being tangled in a global training loop.
Distributed Checkpoint Service (DCS)
Relax provides a dedicated weight‑synchronization service that distributes updated weights to all inference engines with low latency. DCS supports both NCCL (GPU‑to‑GPU) and TCP (cross‑cluster) channels, adapting to various deployment topologies.
Asynchronous Training Pipeline
Relax integrates TransferQueue (TQ) as an asynchronous data bus between all services. TQ’s field‑level storage allows different fields of the same sample (e.g., generated result, log‑probs, reward) to be written and read independently at different times, matching the multi‑stage computation pattern of RL.
Using a single max_staleness parameter, Relax can switch between On‑Policy and Off‑Policy modes. In fully asynchronous runs, On‑Policy gains a 12% speedup over the colocated baseline, while Off‑Policy improves by 76%.
Key Mechanisms
Streaming micro‑batch scheduling : The global batch is split into micro‑batches; each micro‑batch is written to TQ as soon as it finishes, eliminating the global batch synchronization bottleneck.
Actor‑train resource separation : Log‑probability and reference‑log‑probability calculations run on independent GPUs, fully overlapping with training time.
Omni‑Native Multimodal Support
Relax natively handles images, audio, and video, combining modality‑aware parallelism with end‑to‑end asynchronous pipelines. Experiments on Qwen3‑Omni‑30B with image‑text‑audio (AVQA‑R1‑6K) and video (NextQA) data show stable convergence after more than 2,000 training steps for video.
Agentic RL Scenarios
Relax decouples infrastructure from algorithmic concerns, enabling flexible multi‑round agentic workflows. Custom Rollout and Reward services support session state, rule‑based rewards, generative reward models (GenRM), and user‑defined reward interfaces. Tool use is integrated as asynchronous service calls within the rollout loop.
Experimental Results
End‑to‑end performance vs. veRL : On a 2‑machine 16‑GPU DAPO‑Math task, Relax is 20% faster than veRL, thanks to streaming micro‑batch scheduling and resource separation that hide forward‑pass latency.
MoE training stability (Near‑Zero‑Overhead R3) : On Qwen3‑30B‑A3B, Relax reduces rollout‑routing‑replay mismatch by 38% with only a 1.9% overhead, whereas veRL’s R3 adds 34% overhead. This is achieved by rewriting the serialization path to use NCCL native broadcast and GPU‑resident asynchronous transfers.
Conclusion and Outlook
The three intertwined challenges—data heterogeneity, system fragility, and role coupling—require a coordinated design rather than isolated fixes. Relax’s collaborative solution combines multimodal‑native pipelines, service‑oriented isolation with DCS, and micro‑batch asynchronous pipelines, forming a closed causal loop. Future work will extend the system to larger scales and more complex agentic RL workloads.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
