Artificial Intelligence 9 min read

Beyond Visual Realism: WorldArena Benchmark Reveals the Capability Gap in Embodied World Models

WorldArena introduces a unified benchmark that evaluates generated videos not only for visual fidelity but also for embodied task functionality across six dimensions, exposing a stark gap between visual realism and practical usefulness and providing a composite EWMScore to compare models.

HyperAI Super Neural

Feb 14, 2026

Beyond Visual Realism: WorldArena Benchmark Reveals the Capability Gap in Embodied World Models

Six core evaluation dimensions

WorldArena defines a six‑dimensional metric suite to assess generated video quality beyond pixel‑level realism.

Visual quality : measures image clarity, aesthetic score and JEPA representation similarity to evaluate whether frames follow the real data distribution.

Motion quality : uses optical‑flow continuity, motion‑intensity analysis and smoothness to check temporal coherence and physical plausibility of object trajectories.

Content consistency : tracks foreground and background stability over time, detecting structural drift, identity swaps or background discontinuities that are required for long‑sequence tasks.

Physical compliance : evaluates whether interactions between robotic arms and objects obey basic dynamics, ensuring that motion is not only visually plausible but also physically correct.

3D accuracy : computes depth‑estimation error and perspective consistency to verify that the model captures true 3‑D geometry, a prerequisite for precise manipulation.

Controllability : tests whether the model follows semantic instructions and produces distinguishable outcomes under varying conditions.

Functional evaluation of world models

WorldArena places world models into three downstream roles to measure practical utility.

Data‑generation engine : synthetic trajectories are used to train downstream policies (e.g., VLA). Experiments show that a few models yield modest performance gains, but overall synthetic data remains far inferior to real data, providing limited reliable benefit for policy learning.

Strategy evaluator : world models simulate environments for policy assessment. The high‑fidelity model CtrlWorld achieves a Pearson correlation of 0.986 with scores obtained in the real environment, while other models show weaker alignment that mirrors their visual‑quality deficits.

Action planner : world models are integrated into a closed‑loop control pipeline. Compared with a dedicated planner ( Pi 0.5 ), world‑model‑based planners produce plausible short‑term predictions but lose direction in long‑horizon planning, resulting in noticeably lower task performance.

Visual realism vs. functional utility

Fourteen state‑of‑the‑art world models were benchmarked. The unified EWMScore aggregates the six dimensions into a single comparable number. EWMScore correlates strongly with human perceptual judgments, confirming its validity for visual assessment.

However, correlation with downstream embodied tasks is low: 0.600 for the data‑generation role and 0.360 for the motion‑planning role. This demonstrates that high visual fidelity does not guarantee functional usefulness for robots.

Paper title: WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

ArXiv: http://arxiv.org/abs/2602.08971

Project homepage: http://world-arena.ai

Code repository: https://github.com/tsinghua-fib-lab/WorldArena

Video Generation Evaluation Metrics benchmark embodied AI Robotics Physical Consistency WorldArena

Written by

HyperAI Super Neural

Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.