LARYBench: The ImageNet‑Scale Benchmark Bridging Vision and Action for Embodied AI
LARYBench, the first large‑scale benchmark for embodied intelligence, quantifies implicit action representations across 1.2 million video clips, evaluates vision‑only and robot‑specific models, and reveals how general visual encoders can close the vision‑action modality gap.
Introduction
LARYBench is a systematic benchmark that quantifies the quality of implicit action representations learned from massive online video data. By providing a two‑dimensional scale, it separates visual representation learning from downstream control policies.
Three Real‑World Bottlenecks for Embodied Intelligence
Data acquisition difficulty : Precise robot action annotations require costly tele‑operation, limiting scale, while human video libraries lack robot‑usable control signals.
Representation transferability : Action data are tightly bound to specific hardware, making cross‑platform feature reuse hard.
Lack of unified metrics : Without an independent scale for intermediate representations, most models remain confined to task‑specific fine‑tuning.
Implicit action representations learned from spatio‑temporal video dynamics are proposed as the key to overcoming these challenges.
Metric Design
LARYBench defines a two‑dimensional evaluation:
Semantic action classification – measures how well the latent vector z captures intent, covering atomic and composite actions.
Physical pose regression – measures how accurately z predicts end‑effector pose parameters (7/12/16‑DoF).
Dataset Composition
The benchmark aggregates over 1.2 million annotated video clips (more than 1,000 hours), 620 k image‑action pairs, and 595 k motion trajectories. Action categories are split into three hierarchical levels – body‑level, atomic semantic, and composite semantic – totaling 151 distinct actions, ranging from basic pick/place to long‑tail tasks such as shovel‑snow and float‑balloon.
Data span 11 robot platforms (e.g., Franka, Agilex Cobot, Realman, G1) and include diverse environments from residential kitchens to industrial scenes.
Data Pipeline
LARYBench employs a fully automated multi‑granularity engine. Video slicing, description matching, and feature normalization are handled by an algorithmic loop, dramatically improving processing efficiency for heterogeneous data.
The system introduces a Motion‑Guided Sampler (MGSampler) that computes inter‑frame motion intensity to ensure extracted sequences contain sufficient physical dynamics.
Evaluation Protocol
Models first extract an implicit action embedding z. A shallow probing head then decouples the embedding to assess the two metrics above. The overall score aggregates semantic classification accuracy and pose regression error.
Experimental Analysis
Four representative paradigms are evaluated:
Embodied LAMs (explicitly designed for robotics)
Semantic‑level general visual encoders
Pixel‑level general visual encoders
General LAMs built on top of generic backbones
Key findings:
General visual encoders (e.g., V‑JEPA 2, DINOv3) outperform specialized Embodied LAMs on both semantic capture and low‑level control without any explicit action supervision.
In atomic vs. composite semantic classification, DINOv3 achieves an average MSE of 0.19 versus 0.30 for Wan2.2 VAE.
Stride ablation shows that increasing prediction stride from 5 to 30 degrades pixel‑level models (MSE rises to 0.62) while implicit LAMs remain stable, confirming that latent spaces encode continuous motion trajectories.
Cross‑attention heatmaps reveal V‑JEPA 2 and DINOv3 focus on hand‑object interaction regions, whereas Embodied models display diffuse attention and pixel‑level models are distracted by lighting changes.
Hyperparameter Ablation
Adjusting codebook size, sequence length, latent dimension, and learning rate yields consistent gains. Notably:
Freezing a generic visual encoder as the backbone dramatically improves LAM performance over pixel‑reconstruction baselines.
Increasing sequence length and latent dimension within reasonable bounds enhances feature expressiveness.
Codebook capacity shows diminishing returns; expanding from 64 to 256 reduces utilization to 89.5% and slightly harms performance.
From Perception to Action
The experiments demonstrate that robust action priors can emerge from massive unlabeled internet videos. Future Vision‑Language‑Action (VLA) models should first learn such priors at scale and then align them with low‑level control, rather than building action spaces from scarce robot‑annotated data.
This paradigm promises to turn the data‑scale advantage of the visual world into actionable intelligence, guiding embodied AI toward its own “GPT moment.”
Resources: Paper: https://huggingface.co/papers/2604.11689, GitHub: https://github.com/meituan-longcat/LARYBench, Project page: https://meituan-longcat.github.io/LARYBench/, HuggingFace dataset: https://huggingface.co/datasets/meituan-longcat/LARYBench, ModelScope: https://modelscope.cn/datasets/meituan-longcat/LARYBench
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
