How 100,000 Hours of Human Data Propelled Psi‑R2 to Lead MolmoSpaces

Lingchu AI demonstrates that scaling human‑operation data to nearly 100,000 hours, combined with a two‑model system and reinforcement learning, can replace costly robot‑teleoperation data and achieve top performance on the MolmoSpaces benchmark.

Machine Heart
Machine Heart
Machine Heart
How 100,000 Hours of Human Data Propelled Psi‑R2 to Lead MolmoSpaces

Embodied intelligence is reaching a new narrative stage where relying solely on real‑robot tele‑operation data is insufficient for large‑scale deployment. The article argues that the next competitive edge lies in converting massive human‑operation experience into robot‑learnable, iterative capabilities.

What Was Released

On April 10, Lingchu AI announced a suite consisting of the strategy model Psi‑R2 , the world model Psi‑W0 , and close to 100,000 hours of human operation data (including an open‑source 1,000‑hour subset). The data includes 5,417 hours of real‑robot data collected via the in‑house MobiDex platform and 95,472 hours of multi‑scene, multi‑task, multi‑object human hand data.

Why Human Data?

Unlike large‑language models or autonomous driving, embodied AI lacks an internet‑scale data dividend, making data acquisition the industry’s core bottleneck. Human hands naturally generate high‑frequency, fine‑grained manipulation data in real environments, providing realistic timing and detail that laboratory demos cannot capture.

Challenges of Human Data

The primary difficulty is the embodiment gap —kinematic and dynamic differences between human and robotic hands. Additionally, many human recordings are first‑person videos with only centimeter‑level trajectory precision, which is inadequate for sub‑millimeter tasks such as phone assembly.

To address precision, Lingchu developed exoskeleton tactile gloves and high‑precision perception hardware for high‑fidelity 3D hand trajectory capture, while retaining larger‑scale raw hand data for generalization.

Methodology: Raw‑Data‑In, Raw‑Data‑Out

Rather than extensive alignment techniques (image inpainting, key‑point‑assisted loss, feature‑space alignment), Lingchu found that such methods help only at small data scales and become bottlenecks when data volume grows. The final approach aligns only input‑output dimensions, mapping human joints to robot joints via kinematic transformation and feeding raw images directly to the model.

Model Roles

Psi‑R2 learns “how to do” from human data, taking images and language as input and outputting future video frames and robot actions, effectively predicting how the world will evolve. After large‑scale pre‑training, it requires fewer than 100 real‑robot trajectories for fine‑tuning to accomplish long‑range, precise tasks such as phone assembly, industrial packaging, and box stacking.

Psi‑W0 complements Psi‑R2 by modeling failures, counterfactuals, and trial‑and‑error space. Its training incorporates roughly 30% failure data, enabling it to evaluate and refine strategies generated by Psi‑R2 through rollout and subsequent reinforcement‑learning corrections.

System Synergy

The effective pipeline is: human data → Psi‑R2 learns task knowledge → trajectories are rolled out in Psi‑W0 → reinforcement learning refines trajectories to satisfy robot dynamics, creating a feedback loop where good trajectories enrich the training set and bad ones improve the world model.

Engineering Optimizations

Inference latency was reduced from 2.2 seconds to under 100 ms using DiT caching, Torch‑Compile, and quantization, meeting deployment thresholds for smooth, agile manipulation.

Benchmark Results

On the MolmoSpaces Combined leaderboard (excluding MolmoBot data), Psi‑R2 achieved the highest Oracle Success Rate of 46.4, covering four tasks, surpassing PI, DreamZero, and other international models. MolmoSpaces, initiated by AllenAI, is a leading benchmark in embodied AI with participation from NVIDIA, PI, and other top teams.

Industry Insights from the Release

Task diversity > object diversity >> scene diversity.

Precise 3D pose > tactile > 2D image features.

These observations suggest that model limits are driven more by the breadth of tasks and objects encountered and the fidelity of contact and manipulation details than by scene complexity.

Conclusion

The article positions human data as the mainline, not a shortcut, and emphasizes that the true breakthrough comes from the integrated system of Psi‑R2, Psi‑W0, and reinforcement learning rather than any single model. Commercial viability hinges on beatable cadence, cost, inference speed, and a sustainable data‑flywheel.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Embodied AIreinforcement learninglarge-scale datasetrobotic manipulationhuman dataPsi-R2Psi-W0
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.