LARYBench Introduces an ImageNet‑Style Benchmark for Embodied Action Representations Learned from Human Video
LARYBench (Latent Action Representation Yielding Benchmark) provides the first systematic, ImageNet‑scale evaluation for implicit action representations derived from large‑scale human video, decoupling representation quality from downstream control, and shows that general‑purpose vision models outperform specialized embodied models in both action generalization and control precision across diverse robot morphologies and environments.
Robots excel in fixed scenes but struggle to generalize to new environments because the embodied field lacks large‑scale, action‑annotated training data; abundant human video offers a promising alternative.
LARYBench (Latent Action Representation Yielding Benchmark) is a systematic evaluation suite that measures how well implicit action representations can be learned from massive visual data. Experiments demonstrate that general‑purpose vision models surpass specialized embodied models in both action generalization and control precision, indicating that embodied action representations can emerge from human video.
Challenges in Vision‑Language‑Action (VLA) Models
Data bottleneck: Precise robot action labels require costly teleoperation, while human videos lack executable action tags, creating a modality gap.
Representation bottleneck: Extracted action data are tightly bound to specific hardware, hindering cross‑morphology transfer; implicit representations aim to capture frame‑to‑frame changes independent of embodiment.
Paradigm bottleneck: Reliance on manual annotation limits embodied AI to fixed‑scene fine‑tuning; a data‑driven pre‑training path using unlabeled human video is needed.
LARYBench Design
The benchmark evaluates representations at two granularity levels—body‑level actions (end‑effector pose) and semantic actions (atomic or composite descriptions). It provides over one million annotated video segments (≈1000 h), 151 action types, 620 k image‑action pairs and 595 k motion trajectories, covering 11 robot morphologies and diverse environments.
Evaluation pipeline: a video or image pair is fed to a latent action model (LAM) to produce representation z; a shallow probing head then measures z either by action‑expert regression (MSE) or attentive classification (accuracy).
Experimental Setup
Four datasets are used for body‑level regression: CALVIN, VLABench (both simulated single‑arm), RoboCOIN and AgiBotWorld‑Beta (real‑world dual‑arm). All models are scored by mean‑squared error (lower is better). For semantic classification, tasks are split into atomic, composite‑human and composite‑robot actions.
Results
Body‑level regression: DINOv3 achieves the lowest average MSE (0.19) across the four datasets, while the specialized embodied model LAPA records a higher average MSE (0.97). Semantic‑level encoders (V‑JEPA‑2, DINOv3) slightly outperform pixel‑level VAE models, showing that body‑level information survives in higher‑level feature spaces.
Semantic classification: Semantic‑level general vision encoders consistently lead across all three task categories; embodied models lag behind, and generic LAMs sit in the middle. Visual‑self‑supervised learning captures action semantics better than image‑text contrastive methods.
Long‑tail analysis: Performance gaps between strong and weak models widen on low‑frequency actions, indicating that richer representations improve rare‑action generalization.
Attention visualization: For a “pour” sequence, V‑JEPA‑2 and DINOv3 focus attention on hand‑object interactions, whereas pixel‑level VAEs spread attention to irrelevant regions and the embodied LAPA shows almost no focused attention.
Ablation on LAPA‑DINOv3: Varying codebook size, sequence length, latent dimension and learning rate reveals that moderate increases in sequence length and latent dimension boost performance, while codebook size has an optimal range rather than monotonic improvement.
Insights
LARYBench decouples representation quality from downstream policies, offering a unified metric for cross‑embodiment generalization.
General‑purpose vision models can learn transferable action semantics from massive human video without explicit action supervision, exposing the limits of current embodied‑specific models.
The benchmark validates the scalability value of human video data for action representation learning and suggests a path toward data‑driven embodied intelligence.
The dataset, code, and evaluation scripts are open‑source on GitHub and will be continuously maintained.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
