Artificial Intelligence 12 min read

LARYBench: The ImageNet‑Scale Benchmark Bridging Vision and Action for Embodied AI

LARYBench, the first large‑scale benchmark for embodied intelligence, quantifies implicit action representations across 1.2 million video clips, evaluates vision‑only and robot‑specific models, and reveals how general visual encoders can close the vision‑action modality gap.

Data Party THU

Apr 22, 2026

LARYBench: The ImageNet‑Scale Benchmark Bridging Vision and Action for Embodied AI

Introduction

LARYBench is a systematic benchmark that quantifies the quality of implicit action representations learned from massive online video data. By providing a two‑dimensional scale, it separates visual representation learning from downstream control policies.

Three Real‑World Bottlenecks for Embodied Intelligence

Data acquisition difficulty : Precise robot action annotations require costly tele‑operation, limiting scale, while human video libraries lack robot‑usable control signals.

Representation transferability : Action data are tightly bound to specific hardware, making cross‑platform feature reuse hard.

Lack of unified metrics : Without an independent scale for intermediate representations, most models remain confined to task‑specific fine‑tuning.

Implicit action representations learned from spatio‑temporal video dynamics are proposed as the key to overcoming these challenges.

Metric Design

LARYBench defines a two‑dimensional evaluation:

Semantic action classification – measures how well the latent vector z captures intent, covering atomic and composite actions.

Physical pose regression – measures how accurately z predicts end‑effector pose parameters (7/12/16‑DoF).

Dataset Composition

The benchmark aggregates over 1.2 million annotated video clips (more than 1,000 hours), 620 k image‑action pairs, and 595 k motion trajectories. Action categories are split into three hierarchical levels – body‑level, atomic semantic, and composite semantic – totaling 151 distinct actions, ranging from basic pick/place to long‑tail tasks such as shovel‑snow and float‑balloon.

Data span 11 robot platforms (e.g., Franka, Agilex Cobot, Realman, G1) and include diverse environments from residential kitchens to industrial scenes.

Data Pipeline

LARYBench employs a fully automated multi‑granularity engine. Video slicing, description matching, and feature normalization are handled by an algorithmic loop, dramatically improving processing efficiency for heterogeneous data.

The system introduces a Motion‑Guided Sampler (MGSampler) that computes inter‑frame motion intensity to ensure extracted sequences contain sufficient physical dynamics.

Evaluation Protocol

Models first extract an implicit action embedding z. A shallow probing head then decouples the embedding to assess the two metrics above. The overall score aggregates semantic classification accuracy and pose regression error.

Experimental Analysis

Four representative paradigms are evaluated:

Embodied LAMs (explicitly designed for robotics)

Semantic‑level general visual encoders

Pixel‑level general visual encoders

General LAMs built on top of generic backbones

Key findings:

General visual encoders (e.g., V‑JEPA 2, DINOv3) outperform specialized Embodied LAMs on both semantic capture and low‑level control without any explicit action supervision.

In atomic vs. composite semantic classification, DINOv3 achieves an average MSE of 0.19 versus 0.30 for Wan2.2 VAE.

Stride ablation shows that increasing prediction stride from 5 to 30 degrades pixel‑level models (MSE rises to 0.62) while implicit LAMs remain stable, confirming that latent spaces encode continuous motion trajectories.

Cross‑attention heatmaps reveal V‑JEPA 2 and DINOv3 focus on hand‑object interaction regions, whereas Embodied models display diffuse attention and pixel‑level models are distracted by lighting changes.

Hyperparameter Ablation

Adjusting codebook size, sequence length, latent dimension, and learning rate yields consistent gains. Notably:

Freezing a generic visual encoder as the backbone dramatically improves LAM performance over pixel‑reconstruction baselines.

Increasing sequence length and latent dimension within reasonable bounds enhances feature expressiveness.

Codebook capacity shows diminishing returns; expanding from 64 to 256 reduces utilization to 89.5% and slightly harms performance.

From Perception to Action

The experiments demonstrate that robust action priors can emerge from massive unlabeled internet videos. Future Vision‑Language‑Action (VLA) models should first learn such priors at scale and then align them with low‑level control, rather than building action spaces from scarce robot‑annotated data.

This paradigm promises to turn the data‑scale advantage of the visual world into actionable intelligence, guiding embodied AI toward its own “GPT moment.”

Resources: Paper: https://huggingface.co/papers/2604.11689, GitHub: https://github.com/meituan-longcat/LARYBench, Project page: https://meituan-longcat.github.io/LARYBench/, HuggingFace dataset: https://huggingface.co/datasets/meituan-longcat/LARYBench, ModelScope: https://modelscope.cn/datasets/meituan-longcat/LARYBench

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark embodied AI robotics Multimodal Learning LARYBench vision-action gap

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.