How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%

SeamlessFlow, an industrial‑scale reinforcement‑learning training framework released by Kuaipilot, decouples trainer and agents via a novel data‑plane, introduces a tag‑based resource scheduler, and eliminates pipeline bubbles, achieving up to 100% token‑throughput boost and 62% reduction in overall training time across large‑model RL workloads.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
How SeamlessFlow Doubles RL Training Throughput and Cuts Time by 62%

Introduction

The Kuaipilot team at Kuaishou recently published the SeamlessFlow technical report, describing an industrial‑grade reinforcement‑learning (RL) training framework designed for large‑model scenarios.

Challenges in Large‑Model RL Training

Two major difficulties are identified:

Strong coupling between training logic and agent execution : agents often contain complex internal logic (memory, multi‑branch inference, extensions) that makes the RL pipeline tightly dependent on each agent’s implementation, increasing maintenance cost and causing incomplete trajectory records.

Conflict between compute utilization and system stability : colocated deployment (training and inference on the same machines) maximizes GPU utilization but lacks flexibility and risks cascading failures; disaggregated deployment (separate clusters) improves stability but introduces pipeline bubbles that waste GPU cycles.

Data‑Plane Decoupling

SeamlessFlow introduces an independent data‑plane layer that completely separates the RL trainer from agents. A transparent Trajectory Manager records every token‑level input and output without requiring agents to adapt to the training framework.

The manager also reconstructs conversation trees using longest‑prefix matching, enabling efficient storage and precise on‑policy/off‑policy labeling.

Another component, the Rollout Manager , orchestrates the system’s rhythm. When enough samples are collected or a model update is needed, it pauses inference on selected machines while allowing other agents to continue, achieving a seamless handover.

Tag‑Driven Resource Scheduling

Resources are abstracted as capability tags (e.g., rollout, train). The scheduler assigns tasks based on tags rather than physical machine identity, unifying colocated and disaggregated designs under a single framework.

By assigning both rollout and train tags to a subset of machines, SeamlessFlow enables spatiotemporal multiplexing : during data‑collection phases all machines generate data; when training is triggered, the tagged machines switch to training while others keep generating, eliminating pipeline bubbles and reducing GPU idle time to below 5%.

Experimental Validation

On a 32‑GPU H800 cluster, SeamlessFlow achieved a 100% increase in token‑throughput and a 62% reduction in total training time compared with the VERL baseline for an 8k‑token RL task.

In more demanding scenarios (64K‑token code generation with the SWE‑agent scaffold), it delivered a 1.55× throughput gain, and scaling from 32 to 64 GPUs further amplified the advantage.

For software‑engineering RL tasks on the SWE‑Bench dataset, Qwen‑3‑8B’s success rate rose from 12.2% to 27.4%, and Qwen‑3‑32B improved from 23% to 45.8% when trained with SeamlessFlow.

Design Insights

SeamlessFlow embodies a "focus‑separation" architecture: by extracting trajectory management from agents, algorithm engineers can concentrate on RL improvements while product engineers iterate on agent features without breaking the training pipeline.

The tag‑driven abstraction also demonstrates "unified abstraction": it reconciles the trade‑off between efficiency and stability, allowing dynamic resource reallocation based on real‑time load.

Conclusion

SeamlessFlow represents a new paradigm for industrial‑scale RL training, delivering high efficiency, stability, and flexibility. Its principles are applicable beyond RL to any large‑scale machine‑learning system requiring tight integration of training and inference.

Figure 1: SeamlessFlow Overview
Figure 1: SeamlessFlow Overview
Figure 2: Architecture Diagram
Figure 2: Architecture Diagram
Figure 3: Tag‑Based Scheduling
Figure 3: Tag‑Based Scheduling
Figure 4: Throughput Comparison
Figure 4: Throughput Comparison
Figure 5: Scaling Performance
Figure 5: Scaling Performance
Figure 6: Reward Curve
Figure 6: Reward Curve
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

reinforcement learningresource schedulingDistributed Trainingpipeline optimization
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.