LARYBench: An ImageNet‑Scale Benchmark Unlocks Embodied AI Generalization

Researchers introduce LARYBench, the first large‑scale benchmark for evaluating implicit action representations in embodied AI, providing over 1.2 million annotated video clips, a unified metric for motion semantics, and extensive experiments showing that general visual encoders outperform specialized robot models in action understanding and control.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
LARYBench: An ImageNet‑Scale Benchmark Unlocks Embodied AI Generalization

LARYBench is a systematic benchmark that evaluates implicit action representations for embodied intelligence, aiming to bridge visual perception and physical control.

Three Real‑World Bottlenecks for Embodied Generalization

Data acquisition difficulty: Precise robot action annotations require costly tele‑operation, limiting scale, while large human video collections lack robot‑usable control signals.

Representation transfer difficulty: Traditional action data are tightly bound to specific hardware, making cross‑modal reuse challenging.

Lack of unified metrics: Without an independent scale to measure intermediate representation quality, most models remain confined to task‑specific fine‑tuning rather than large‑scale unsupervised pre‑training.

Implicit action representations, learned from the spatio‑temporal evolution of video frames, are proposed as a key to overcoming these bottlenecks.

LARYBench Design and Dataset

LARYBench is not a single dataset but a comprehensive evaluation framework that quantifies the quality of implicit action representations across two core dimensions: physical execution (body actions) and high‑level semantic understanding (semantic actions) .

The benchmark aggregates more than 1.2 million annotated video segments (over 1 000 hours), 620 k image pairs, and 595 k motion trajectories. Action categories are hierarchically divided into body actions , atomic semantic actions , and compound semantic actions , totaling 151 categories that cover basic interactions such as pick/place and long‑tail scenarios like shovel‑snow or float‑balloon.

Data are collected from 11 robot platforms, ranging from the Franka single‑arm robot to Agilex Cobot, Realman, and the semi‑humanoid G1, and include diverse environments from residential kitchens to industrial settings.

A fully automated multi‑granularity data engine performs video slicing, description matching, and feature normalization in a closed‑loop pipeline, supplemented by rigorous manual inspection.

Evaluation Metrics

The benchmark decouples implicit representations z from downstream control by using shallow probing heads. Two independent evaluation dimensions are defined:

Semantic action classification: Measures how accurately z identifies action intent, covering both atomic and compound behaviors.

Body‑action regression: Measures how well z can reconstruct the absolute pose parameters (7/12/16‑DoF) of the end‑effector.

Experimental Analysis

The authors evaluate four representative paradigms in embodied AI: specialized implicit action models (Embodied LAMs), generic semantic visual encoders, generic pixel‑level visual encoders, and General LAMs built on common backbones.

Macro performance: Without any explicit action supervision, generic visual encoders such as V‑JEPA 2 and DINOv3 achieve higher semantic capture and lower control error than dedicated Embodied LAMs.

Regression error comparison: DINOv3 attains an average mean‑squared error (MSE) of 0.19, outperforming the pixel‑level VAE (Wan2.2) with MSE 0.30, indicating that implicit visual features align better with physical control than pixel‑wise generative models.

Long‑tail generalization: As action frequency decreases, the performance gap between strong and weak models widens, demonstrating that high‑quality visual pre‑training helps in data‑scarce scenarios.

Temporal stability (stride ablation): Increasing prediction stride from 5 to 30 causes the pixel‑level model FLUX.2‑dev to see MSE rise to 0.62, while implicit LAMs remain stable, confirming that implicit spaces encode continuous motion trajectories.

Attention heatmaps: Cross‑attention visualizations show that V‑JEPA 2 and DINOv3 focus precisely on hand‑object interaction regions during pouring actions, whereas specialized embodied models exhibit diffuse attention and pixel models are distracted by lighting changes.

Hyper‑parameter Ablations

Key hyper‑parameters—codebook size, sequence length, latent dimension, and learning rate—are systematically varied. Findings include:

Vision backbone importance: Freezing a generic encoder and training a LAM on top yields significantly better performance than pixel‑reconstruction baselines.

Parameter balance: Within reasonable ranges, larger sequence lengths and latent dimensions improve feature expressiveness.

Codebook capacity limits: Increasing codebook size from 64 to 256 reduces utilization to 89.5 % and slightly degrades performance, indicating diminishing returns.

Data‑scale dependency: Observed performance fluctuations are tied to the current dataset size; larger data volumes are expected to further push the limits of implicit representations.

From Perception to Action

The results demonstrate that robust action priors can emerge from massive unlabeled internet videos. Future Vision‑Language‑Action (VLA) models should first learn stable action priors from such data and then align them with low‑level control, turning the scale advantage of visual data into actionable robot intelligence and paving the way toward a GPT‑like moment for embodied AI.

Resources:

Paper (arXiv 2604.11689): https://huggingface.co/papers/2604.11689

GitHub repository: https://github.com/meituan-longcat/LARYBench

Project homepage: https://meituan-longcat.github.io/LARYBench/

HuggingFace dataset: https://huggingface.co/datasets/meituan-longcat/LARYBench

ModelScope dataset: https://modelscope.cn/datasets/meituan-longcat/LARYBench

benchmarkembodied AIRoboticsLARYBenchaction representationvision encoders
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.