How LWD Redefines Embodied AI Training with Fleet‑Scale Reinforcement Learning

LWD (Learning While Deploying) introduces a distributed multi‑robot reinforcement‑learning framework that continuously improves VLA policies during real‑world deployment, leveraging DIVL, QAM, dynamic n‑step TD and an asynchronous actor‑learner architecture to achieve over 90% success on five‑minute tasks and outperform traditional behavior‑cloning, HG‑Dagger and RECAP baselines.

Machine Heart
Machine Heart
Machine Heart
How LWD Redefines Embodied AI Training with Fleet‑Scale Reinforcement Learning

After the release of the Generalist VLA model, Luo Jianlan’s team unveiled LWD (Learning While Deploying), a new training paradigm that upgrades the capabilities of embodied intelligence robots from single‑task specialists to generalist agents capable of handling a wide variety of real‑world tasks.

Current VLA (Vision‑Language‑Action) models can translate visual inputs and natural‑language commands into joint trajectories, but they remain unreliable for long‑horizon or complex conditions. Failures are frequent, and engineers must manually record corner cases and provide dozens of tele‑operation demonstrations to fine‑tune the model, creating a “whack‑a‑mole” cycle where unseen situations stall the robot.

LWD addresses this by performing large‑scale distributed reinforcement learning directly in the deployment environment. It starts from a pretrained VLA policy and continuously augments it with data collected from robot fleets—including expert demonstrations, successful and failed rollouts, and exploratory “play” interactions. The offline data pool is mixed online with fresh deployment data, and the updated policy is pushed back to the fleet, forming a closed‑loop data flywheel.

The system introduces four key innovations:

DIVL (Distributional Implicit Value Learning) : separates value estimation from policy extraction and fits a value distribution rather than a single scalar, allowing adaptive strategy updates under sparse and heterogeneous rewards.

QAM (Q‑learning with Flow‑Matching) : replaces the costly flow‑matching architecture with a Q‑learning objective that aligns actions along flow trajectories, eliminating the need for explicit action‑likelihood gradients and reducing compute overhead.

Dynamic n‑step TD Strategy : automatically adjusts the TD horizon (n) based on task length and training stage—using n=10 for offline long‑horizon tasks and n=1 during online deployment—to accelerate credit assignment while keeping variance low.

Segmented Asynchronous Actor‑Learner Architecture : decouples the robot fleet (actors) that collect data from the cloud‑based learner that updates the policy. Actors upload trajectories asynchronously, while a central coordinator snapshots and synchronises data for consistent training. This yields a 41‑second ingestion latency and a 38‑second model‑update latency.

To validate LWD, the authors deployed the system on 16 Agibot G1 dual‑arm robots and evaluated eight real‑world tasks, including four long‑horizon operations (making Kung‑Fu tea, juicing, mixing cocktails, and packing shoe boxes). In the four minute‑level tasks, the online‑trained LWD achieved an average success rate of 0.95, surpassing pure behavior cloning (0.76), HG‑Dagger (0.85) and the state‑of‑the‑art offline‑after‑training method RECAP (0.85). For the most challenging long‑horizon group, LWD scored 0.91, beating RECAP (0.77) and Dagger‑SOP (0.73), while also reducing cycle time by 23.75 seconds.

The offline data pool comprised 652.5 hours of robot experience, of which 51.6 % were perfect expert demonstrations and 34.8 % were completely failed trajectories—both of which contributed valuable learning signals for the RL algorithm.

These results demonstrate that continuous post‑deployment learning can break the “static‑model ceiling” of VLA systems, much like RLHF transformed large‑language models. The authors argue that the future of generalist robots will be judged not by the amount of data baked in at launch, but by how quickly they can learn and adapt after being deployed across diverse real‑world environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

embodied AIroboticsreinforcement learningdistributed trainingVLALWDoffline-to-online RL
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.