Zero‑Shot Dual‑Arm Robot Learning from 30 Minutes of Human Egocentric Video (HumanEgo)

HumanEgo shows that a single 30‑minute egocentric video captured with a wearable Aria camera can train a dual‑arm robot to achieve 92.5% success on four real‑world tasks, transfer zero‑shot across robots, cameras and environments, and outperform tele‑operation while requiring far less data.

Machine Heart
Machine Heart
Machine Heart
Zero‑Shot Dual‑Arm Robot Learning from 30 Minutes of Human Egocentric Video (HumanEgo)

Problem Motivation

Traditional robot learning relies on tele‑operation data, which is expensive, requires fully equipped labs, and couples the data tightly to specific robot hardware, making reuse across platforms difficult.

HumanEgo Overview

HumanEgo proposes a new data interface: minutes of first‑person video recorded with a wearable Meta Aria glasses. The pipeline converts raw video into an Interaction‑Centric Token (ICT) representation that abstracts away the embodiment of the human hand and focuses on hand‑object interaction geometry.

Key Representation

Each entity (hand or object) is encoded as a 29‑dimensional ICT describing 6‑DoF pose, relative hand pose, and grasp state.

Entities are detected with Grounding DINO + SAM2, tracked across frames with CoTracker3, and oriented using Orient‑Anything.

During grasp, a kinematic latching step rigidly binds the object pose to the hand to maintain a stable representation despite occlusion.

Vision Processing

The visual front‑end removes the human arm using SAM2 + LaMa, renders a virtual parallel‑jaw gripper, and composites it back into the scene, producing robot‑agnostic observations without costly domain adaptation.

Policy Learning

The control policy is trained with flow matching , which is faster and more expressive than diffusion or ACT. Three dense auxiliary objectives are added:

Object motion prediction (3‑D physical dynamics).

2‑D trajectory regression.

Latent consistency across 3‑D, 2‑D, and latent spaces.

These signals turn each short demonstration into multiple supervision signals, enabling effective learning from only ~60 trajectories (≈30 min of video).

Experimental Evaluation

HumanEgo was evaluated on four real‑world dual‑arm tasks:

Serve Bread : pick up a loaf and place it centrally.

Downstack Cups : multi‑step cup stacking/unstacking.

Water Flowers : coordinated two‑arm watering.

Adjust Table : rotate a knob three full turns.

Each task was run 40 times and compared against five zero‑shot baselines (EgoZero, PointPolicy, ZeroMimic, Track2Act, SPOT) and a matched‑duration tele‑operation baseline (ACT). HumanEgo achieved an average success rate of 92.5% , a 41% absolute gain over tele‑operation.

Data efficiency results show that 15 min of human video already surpasses 30 min of robot tele‑operation (75% vs 51% success), yielding a 3.75× improvement in data efficiency.

Ablation Studies

Removing the ICT representation drops performance to 32.5%, while adding ICT jumps success to 85% (+52.5 pp) and the full model reaches 95%. The dense auxiliary objectives each contribute additional gains.

Zero‑Shot Generalization

A single trained policy was deployed without fine‑tuning to nine out‑of‑distribution conditions (different robot arms, cameras, lighting, backgrounds, and table heights). Success rates remained between 85% and 95%, demonstrating robust zero‑shot transfer.

Broader Implications

The authors argue that the bottleneck for robot learning is not data scarcity but the lack of a scalable data interface. By treating wearable first‑person video as a universal interface, HumanEgo turns robot data collection into a crowd‑sourced, everyday activity, opening the path to large‑scale, diverse robot skill acquisition.

Future Directions

Extending the paradigm to multi‑finger dexterity, long‑horizon industrial workflows, and continual learning from massive online egocentric video collections.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

flow matchingzero-shot transferrobot learningegocentric videoHumanEgointeraction-centric token
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.