Artificial Intelligence 11 min read

How SOP Enables Scalable Online Post-Training for Real‑World Robots

The SOP (Scalable Online Post‑training) framework redesigns VLA post‑training from offline, single‑machine, sequential processing to a distributed, parallel online system, allowing robot fleets to continuously learn, share experiences, and scale intelligence while maintaining stability and generalization in complex real‑world environments.

PaperAgent

Jan 8, 2026

How SOP Enables Scalable Online Post-Training for Real‑World Robots

Background and Motivation

Researchers at the ZhiYuan Robotics Embodied Research Center have built open‑source embodied world‑model platforms and brain‑size systems that let robots understand and decide. While demos proved world models can act as simulators and brain‑size systems as controllers, deploying general‑purpose robots at scale in open, dynamic real‑world settings raises the core problem of achieving scalable, intelligent operation.

Challenges of Real‑World Scale‑Up

Robots must simultaneously satisfy two seemingly contradictory requirements: stability and reliability in ever‑changing environments, and generalization across highly diverse tasks. Existing VLA pre‑training provides broad applicability, but real‑world deployment demands higher task‑specific performance and suffers from diminishing returns of offline data collection, forcing reliance on post‑training. Current VLA post‑training methods are limited by offline, single‑machine, sequential data acquisition, which cannot support efficient, continuous learning in the field.

SOP: Distributed Online Post‑Training Framework

SOP (Scalable Online Post‑training) introduces a distributed, continuous online learning paradigm that transforms VLA post‑training from “offline, single‑machine, sequential” to “online, cluster, parallel”. The low‑latency closed‑loop system follows the flow:

Multiple robots (actors) collect parallel experience.

Data is streamed to a cloud‑based Experience Buffer.

A cloud learner updates the policy using both online and offline data.

Updated parameters are synchronized back to all robots within minutes.

The architecture adopts an Actor–Learner asynchronous model :

Actor (Robot Side)

Robots running the same policy in different locations execute diverse tasks, continuously gathering success, failure, and human‑intervention interaction data. Each robot’s experience is sent to a centralized cloud Experience Buffer.

Learner (Cloud Side)

All trajectories are uploaded in real time, forming a data pool of online interactions and offline expert demonstrations. A dynamic resampling strategy adjusts the online/offline data ratio based on task performance, maximizing the utility of real‑world experience.

Instant Parameter Sync

Updated model parameters are propagated to the entire robot fleet within minutes, ensuring consistent evolution and stable online training.

SOP is a generic framework that can plug in any post‑training algorithm. The authors integrated HG‑DAgger (interactive imitation learning) and RECAP (offline reinforcement learning) as representative algorithms.

Key Advantages

Efficient State‑Space Exploration : Distributed parallel exploration dramatically increases state‑action coverage, overcoming the limits of single‑robot online learning.

Mitigating Distribution Shift : All robots always infer with the latest low‑latency policy, improving stability and consistency.

Preserving Generalization While Boosting Performance : Parallel spatial training, rather than sequential time‑wise training, raises task success rates without degrading the VLA’s broad capabilities.

Experimental Evaluation

The authors evaluated SOP on three dimensions: performance gains, efficiency, and scaling laws.

Performance Gains

Across various test scenes, SOP‑enhanced post‑training yielded significant improvements. In a cluttered supermarket scenario, the HG‑DAgger + SOP combo achieved a 33% overall performance boost. For dexterous tasks such as folding clothes and box assembly, SOP raised success rates above 94% (98% for box assembly) and increased throughput for folding by 114%.

Learning Efficiency with Varying Robot Fleet Sizes

Experiments with 1, 2, and 4 robot fleets under equal total data transmission showed that more robots lead to higher performance within the same training time. With a 3‑hour budget, a four‑robot cluster reached 92.5% success (12% higher than a single robot) and accelerated training speed by 2.4× compared to a single robot.

Stability Across Pre‑Training Scales

Using 20 h, 80 h, and 160 h of multi‑task pre‑training data, the authors found that larger pre‑training datasets produce higher base performance, but SOP consistently adds stable gains regardless of the initial model quality. Notably, SOP delivered ~30% performance improvement with only 3 h of on‑orbit experience, whereas adding 80 h of human expert data contributed merely a 4% boost, highlighting SOP’s ability to break through diminishing returns of offline pre‑training.

SOP performance under different pre‑training scales

Deployment‑Driven Evolution

When deployed in previously unseen real environments, robots initially experienced drops in success rate and throughput. After a few hours of SOP‑enabled online training, performance rebounded sharply, demonstrating robust adaptation to new tasks and settings. The authors argue that robots should be viewed as continuously evolving entities rather than static products; SOP turns deployment into the start of large‑scale learning.

Conclusion

SOP redefines the post‑training paradigm for VLA‑based robots, enabling distributed, online, and scalable learning that preserves generalization while boosting task‑specific performance. By turning robot fleets into collaborative learners, SOP opens a path toward truly intelligent, lifelong‑learning robots in the real world.