Beyond Visual Realism: WorldArena Benchmark Reveals the Capability Gap in Embodied World Models
WorldArena introduces a unified benchmark that evaluates generated videos not only for visual fidelity but also for embodied task functionality across six dimensions, exposing a stark gap between visual realism and practical usefulness and providing a composite EWMScore to compare models.
Six core evaluation dimensions
WorldArena defines a six‑dimensional metric suite to assess generated video quality beyond pixel‑level realism.
Visual quality : measures image clarity, aesthetic score and JEPA representation similarity to evaluate whether frames follow the real data distribution.
Motion quality : uses optical‑flow continuity, motion‑intensity analysis and smoothness to check temporal coherence and physical plausibility of object trajectories.
Content consistency : tracks foreground and background stability over time, detecting structural drift, identity swaps or background discontinuities that are required for long‑sequence tasks.
Physical compliance : evaluates whether interactions between robotic arms and objects obey basic dynamics, ensuring that motion is not only visually plausible but also physically correct.
3D accuracy : computes depth‑estimation error and perspective consistency to verify that the model captures true 3‑D geometry, a prerequisite for precise manipulation.
Controllability : tests whether the model follows semantic instructions and produces distinguishable outcomes under varying conditions.
Functional evaluation of world models
WorldArena places world models into three downstream roles to measure practical utility.
Data‑generation engine : synthetic trajectories are used to train downstream policies (e.g., VLA). Experiments show that a few models yield modest performance gains, but overall synthetic data remains far inferior to real data, providing limited reliable benefit for policy learning.
Strategy evaluator : world models simulate environments for policy assessment. The high‑fidelity model CtrlWorld achieves a Pearson correlation of 0.986 with scores obtained in the real environment, while other models show weaker alignment that mirrors their visual‑quality deficits.
Action planner : world models are integrated into a closed‑loop control pipeline. Compared with a dedicated planner ( Pi 0.5 ), world‑model‑based planners produce plausible short‑term predictions but lose direction in long‑horizon planning, resulting in noticeably lower task performance.
Visual realism vs. functional utility
Fourteen state‑of‑the‑art world models were benchmarked. The unified EWMScore aggregates the six dimensions into a single comparable number. EWMScore correlates strongly with human perceptual judgments, confirming its validity for visual assessment.
However, correlation with downstream embodied tasks is low: 0.600 for the data‑generation role and 0.360 for the motion‑planning role. This demonstrates that high visual fidelity does not guarantee functional usefulness for robots.
Paper title: WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models
ArXiv: http://arxiv.org/abs/2602.08971
Project homepage: http://world-arena.ai
Code repository: https://github.com/tsinghua-fib-lab/WorldArena
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
