How 100,000 Hours of Human Data Propelled Psi‑R2 to Lead MolmoSpaces
Lingchu AI demonstrates that scaling human‑operation data to nearly 100,000 hours, combined with a two‑model system and reinforcement learning, can replace costly robot‑teleoperation data and achieve top performance on the MolmoSpaces benchmark.
Embodied intelligence is reaching a new narrative stage where relying solely on real‑robot tele‑operation data is insufficient for large‑scale deployment. The article argues that the next competitive edge lies in converting massive human‑operation experience into robot‑learnable, iterative capabilities.
What Was Released
On April 10, Lingchu AI announced a suite consisting of the strategy model Psi‑R2 , the world model Psi‑W0 , and close to 100,000 hours of human operation data (including an open‑source 1,000‑hour subset). The data includes 5,417 hours of real‑robot data collected via the in‑house MobiDex platform and 95,472 hours of multi‑scene, multi‑task, multi‑object human hand data.
Why Human Data?
Unlike large‑language models or autonomous driving, embodied AI lacks an internet‑scale data dividend, making data acquisition the industry’s core bottleneck. Human hands naturally generate high‑frequency, fine‑grained manipulation data in real environments, providing realistic timing and detail that laboratory demos cannot capture.
Challenges of Human Data
The primary difficulty is the embodiment gap —kinematic and dynamic differences between human and robotic hands. Additionally, many human recordings are first‑person videos with only centimeter‑level trajectory precision, which is inadequate for sub‑millimeter tasks such as phone assembly.
To address precision, Lingchu developed exoskeleton tactile gloves and high‑precision perception hardware for high‑fidelity 3D hand trajectory capture, while retaining larger‑scale raw hand data for generalization.
Methodology: Raw‑Data‑In, Raw‑Data‑Out
Rather than extensive alignment techniques (image inpainting, key‑point‑assisted loss, feature‑space alignment), Lingchu found that such methods help only at small data scales and become bottlenecks when data volume grows. The final approach aligns only input‑output dimensions, mapping human joints to robot joints via kinematic transformation and feeding raw images directly to the model.
Model Roles
Psi‑R2 learns “how to do” from human data, taking images and language as input and outputting future video frames and robot actions, effectively predicting how the world will evolve. After large‑scale pre‑training, it requires fewer than 100 real‑robot trajectories for fine‑tuning to accomplish long‑range, precise tasks such as phone assembly, industrial packaging, and box stacking.
Psi‑W0 complements Psi‑R2 by modeling failures, counterfactuals, and trial‑and‑error space. Its training incorporates roughly 30% failure data, enabling it to evaluate and refine strategies generated by Psi‑R2 through rollout and subsequent reinforcement‑learning corrections.
System Synergy
The effective pipeline is: human data → Psi‑R2 learns task knowledge → trajectories are rolled out in Psi‑W0 → reinforcement learning refines trajectories to satisfy robot dynamics, creating a feedback loop where good trajectories enrich the training set and bad ones improve the world model.
Engineering Optimizations
Inference latency was reduced from 2.2 seconds to under 100 ms using DiT caching, Torch‑Compile, and quantization, meeting deployment thresholds for smooth, agile manipulation.
Benchmark Results
On the MolmoSpaces Combined leaderboard (excluding MolmoBot data), Psi‑R2 achieved the highest Oracle Success Rate of 46.4, covering four tasks, surpassing PI, DreamZero, and other international models. MolmoSpaces, initiated by AllenAI, is a leading benchmark in embodied AI with participation from NVIDIA, PI, and other top teams.
Industry Insights from the Release
Task diversity > object diversity >> scene diversity.
Precise 3D pose > tactile > 2D image features.
These observations suggest that model limits are driven more by the breadth of tasks and objects encountered and the fidelity of contact and manipulation details than by scene complexity.
Conclusion
The article positions human data as the mainline, not a shortcut, and emphasizes that the true breakthrough comes from the integrated system of Psi‑R2, Psi‑W0, and reinforcement learning rather than any single model. Commercial viability hinges on beatable cadence, cost, inference speed, and a sustainable data‑flywheel.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
