How Relax Powers Scalable Multi‑Modal RL Training with Full‑Async Pipelines

Relax, an open‑source reinforcement‑learning engine from Xiaohongshu AI Platform, combines service‑oriented fault‑tolerant architecture, a distributed checkpoint service, and an asynchronous training pipeline to achieve up to 76% speed‑up and near‑zero overhead for multi‑modal RL workloads.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
How Relax Powers Scalable Multi‑Modal RL Training with Full‑Async Pipelines

Background and Motivation

Relax is an open‑source reinforcement‑learning (RL) training engine built on Megatron‑LM and SGLang. It targets large‑scale multi‑modal and agentic RL scenarios where images, audio, video, and text must be processed jointly.

Key Challenges

Data heterogeneity : High‑resolution visual and audio streams generate massive raw data, impose heavy CPU preprocessing, and cause token explosion, making conventional parallel encoders inefficient.

System fragility : Multi‑modal training increases out‑of‑memory (OOM) risk and long‑running jobs, leading to frequent hardware failures and NCCL timeouts. Existing pipelines lack minute‑level fault recovery and elastic scaling.

Role coupling : In colocated designs all RL roles (actor, critic, rollout, trainer) share GPUs and execute serially, causing the trainer to wait for the slowest rollout. Existing async approaches separate rollout and training but miss fine‑grained pipeline scheduling.

Service‑Oriented Fault‑Tolerant Architecture

Each RL role is encapsulated as an independent Ray Serve service. This provides:

Fault isolation : Failures (e.g., OOM) in one service do not affect others. A two‑level recovery strategy distinguishes stateless roles (in‑place restart) from stateful roles (global checkpoint recovery).

Independent scaling : Services can be scaled individually; for example, rollout replicas can be added without touching the critic cluster.

Lifecycle management : Initialization, checkpointing, and restart are handled at the service level, decoupled from the global training loop.

Distributed Checkpoint Service (DCS)

DCS is a dedicated weight‑synchronization service that distributes updated model weights to all inference engines with low latency. It supports both NCCL (GPU‑GPU) and TCP (cross‑cluster) channels, enabling fast recovery without writing checkpoints to disk.

Asynchronous Training Pipeline

Relax introduces a TransferQueue (TQ) as an asynchronous data bus between services. TQ stores each sample’s fields (generated output, log‑probabilities, reward) independently, allowing producers and consumers to read/write at different times. A single max_staleness parameter toggles on‑policy (low staleness) versus off‑policy (higher staleness) modes.

Performance impact: on‑policy training gains ~12 % speed‑up, off‑policy gains ~76 % compared with a colocated baseline.

Key Mechanisms

Streaming micro‑batch scheduling : The global batch is split into micro‑batches; each micro‑batch is written to TQ as soon as it finishes, eliminating the need for the rollout to wait for the entire batch.

Actor‑train resource separation : Log‑prob and reference log‑prob calculations run on dedicated GPUs in parallel with the main trainer, fully overlapping computation.

Omni‑Native Multi‑Modal Support

Relax natively processes images, audio, video, and text within a unified pipeline. Experiments on Qwen3‑Omni‑30B using AVQA‑R1‑6K (image‑text‑audio) and NextQA (video) show stable convergence after >2,000 training steps for video data.

Agentic RL Extensions

Custom Rollout & Reward : Supports multi‑turn agentic workflows with visual inputs. Rollout services maintain session state, and TQ tracks field readiness per turn. Reward computation can use rule‑based signals, a generative reward model (GenRM), or user‑defined interfaces.

Tool Use : Tool calls are modeled as asynchronous service calls inside the rollout loop.

Performance Evaluation

On a 2‑machine 16‑GPU DAPO‑Math benchmark, Relax outperforms the veRL baseline by 20 % end‑to‑end. The gain originates from streaming micro‑batch scheduling (removing global batch synchronization) and actor‑train resource separation (hiding forward‑pass latency).

MoE Training Stability – Near‑Zero‑Overhead R3

Relax implements a near‑zero‑overhead version of Rollout Routing Replay (R3). On Qwen3‑30B‑A3B, mismatch drops 38 % while runtime increases only 1.9 %, whereas veRL’s R3 adds 34 % overhead. The improvement comes from moving routing data out of Python pickle and broadcasting it via native NCCL, plus GPU‑resident asynchronous transfers.

Conclusion

Relax demonstrates that data heterogeneity, system fragility, and role coupling are most effectively addressed through a coordinated design: omni‑native multi‑modal pipelines, service‑level isolation with fast checkpoint recovery, and a micro‑batch asynchronous pipeline. The system is planned to scale to larger models and more complex agentic RL tasks.

Resources: GitHub project – https://github.com/redai-infra/Relax ; Paper – https://arxiv.org/abs/2604.11554

reinforcement learningDistributed TrainingMulti-ModalRay ServeAsynchronous Pipeline
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.