Real-World Large-Scale Test Shows Robots Learning While Deploying Outperform Baselines on Eight Tasks
The article presents the LWD (Learning While Deploying) framework, detailing its reinforcement‑learning‑driven data flywheel, the DIVL value‑evaluation and QAM policy‑optimization modules, and experimental results where a dual‑arm robot improves success rates by up to 17% and reduces cycle time by 23.75 seconds across eight real‑world tasks, surpassing strong baselines.
Robotics has moved from stage shows to everyday life, but real‑world deployment remains costly because debugging hardware is time‑consuming, risky, and the data generated—especially failure trajectories—are rarely turned into useful learning signals.
Research cost ≈ real‑robot debugging time × hardware depreciation × data under‑utilization
To address the low data utilization, the SOP (Scalable Online Post‑training) system was introduced last year, enabling robots to learn while operating. Building on this, the LWD (Learning While Deploying) framework creates a reinforcement‑learning‑driven closed‑loop flywheel: robots execute tasks, upload all interaction data to the cloud, update policies via RL, redeploy improved policies, and repeat.
The flywheel faces three technical challenges:
Heterogeneous data and continual distribution drift across fleets and tasks.
Extremely sparse reward signals in long‑horizon operations.
VLA (Vision‑Language‑Action) generators produce actions via flow‑matching, which is incompatible with standard policy‑gradient methods.
LWD tackles the first two challenges with DIVL (value‑evaluation module) and the third with QAM (policy‑optimization module).
DIVL – Value Evaluation Module
Traditional implicit Q‑learning (IQL) regresses a scalar Q‑value, which overfits in fleet‑scale settings with mixed trajectories and sparse rewards. DIVL instead learns a full value probability distribution , providing uncertainty‑aware confidence intervals.
Its four core designs are:
Classification‑style value distribution : each state maintains a discrete distribution over possible returns.
Quantile bootstrap target : during TD target computation, the upper quantile of the predicted distribution is used, preserving IQL’s high‑value bias while handling sparsity and drift.
Adaptive entropy‑based scaling : the normalized entropy H(s) signals uncertainty; high entropy lowers the quantile threshold τ to avoid over‑optimism, while low entropy raises τ to pursue higher returns.
Dynamic n‑step TD strategy : offline training uses multi‑step updates (e.g., 10‑step) to propagate long‑range rewards quickly; online deployment switches to single‑step updates to reduce variance and ensure stability.
QAM – Policy Optimization Module
With accurate value estimates, LWD improves policies using Q‑learning with Adjoint Matching (QAM) . Modern VLA models generate actions by flow‑matching from noise a⁰ to a¹, a process that is computationally expensive and numerically unstable for standard back‑propagation.
QAM reformulates policy optimization as a local regression target along the generated trajectory, using the gradient supplied by DIVL’s value network at the trajectory endpoint. This avoids costly back‑propagation through the entire generation chain and provides a smooth direction for the action network to shift toward higher returns.
Experimental Results
The research team evaluated LWD on the Agibot G1 dual‑arm robot across eight high‑difficulty tasks, split into two groups:
Supermarket restocking (4 tasks) : shelf restocking, cold‑case restocking, door‑opening restocking, and error‑correction placement, testing semantic item recognition and instruction understanding.
Long‑horizon operations (4 tasks) : making Kung‑Fu tea, mixing cocktails, juicing, and boxing shoes, each lasting 3–5 minutes with 5–7 intricate sub‑steps.
Key findings:
Improved success rates : In the Kung‑Fu tea task, success rose by 17 %; in the juicing task, success increased by 16 %.
Superior to strong baselines : LWD’s online version achieved an average success rate of 0.95, compared to 0.86 for RECAP and 0.82 for DAgger‑SOP.
Cycle‑time reduction : For the long‑horizon tasks, average cycle time decreased by 23.75 seconds, indicating more efficient action planning and execution.
These results demonstrate that the LWD flywheel enables robots to continuously refine their policies from real‑world interactions, turning every failure into a learning opportunity.
Conclusion
From SOP to LWD, the team has consistently pushed robot training into the real world, turning the act of working itself into a performance‑boosting engine. LWD adds a “self‑driving assistance system” that extracts high‑value optimization signals from every interaction, especially errors, shifting the driver of robot evolution from costly external commands to emergent internal experience.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
