Artificial Intelligence 11 min read

Can World Models Truly Understand Interaction? Inside the Omni-WorldBench

Omni-WorldBench introduces a comprehensive benchmark that shifts world‑model evaluation from visual fidelity to interactive response, detailing its two‑part suite, metric design, extensive prompt taxonomy, and experimental results that reveal current models' strengths and limitations in causal and temporal reasoning.

Amap Tech

Apr 1, 2026

Can World Models Truly Understand Interaction? Inside the Omni-WorldBench

Background

World models have achieved high‑fidelity video generation, but existing benchmarks evaluate only visual realism and text‑video alignment. They do not measure whether actions cause plausible state changes, respect causal logic, or maintain coherent scene evolution over time.

Omni‑WorldBench

Omni‑WorldBench is a benchmark that evaluates the interactive response capability of world models. It consists of two components: the Omni‑WorldSuite prompt collection and the Omni‑Metric evaluation protocol.

Omni‑WorldSuite

The suite contains 1,068 prompts organized along two axes:

Scene coverage : general everyday scenes and task‑driven scenarios such as autonomous driving, embodied robotics, and games.

Interaction levels :

Level 1 – the action affects only the acting entity.

Level 2 – one object directly influences another object.

Level 3 – the action triggers multi‑object or broader environmental changes.

Prompts are generated by two complementary pipelines:

Dataset‑grounded generation : First‑frame images and camera trajectories are extracted from open datasets (e.g., DriveLM for autonomous driving, InternData‑A1 for embodied robotics, Sekai for games). A visual‑language model describes each sequence, and human annotators verify the alignment.

Concept‑driven generation : Interaction prototypes (object, action, scene) are fed to large language/visual models such as ChatGPT‑5.2, Gemini, DeepSeek‑R1. The models produce textual prompts and trajectory specifications, which are refined by humans. Corresponding first‑frame images are synthesized and filtered for physical plausibility.

The suite is further categorized by modality (text, image, trajectory) and by capability dimensions (physical principles, commonsense, causality, closed‑loop consistency, spatial constraints).

Omni‑Metric

Omni‑Metric quantifies four dimensions of model performance:

Generated Video Quality : resolution, temporal flickering, motion smoothness, content‑text alignment, and dynamism.

Camera‑Object Controllability : consistency of objects under camera motion and detection of unnatural transitions.

Interaction Effect Fidelity : whether the generated effect matches the intended action and obeys physical and causal constraints. This dimension is decomposed into four sub‑metrics: InterStab‑L – long‑term consistency of the interacted region. InterStab‑N – stability of non‑target regions. InterCov – coverage of object‑level interaction effects. InterOrder – correctness of event order and causal logic.

AgenticScore : an adaptive aggregation of the three dimensions. A multimodal large language model (MLLM) assigns weights based on prompt semantics rather than using a uniform average, producing a single scalar score.

Experimental Protocol

We evaluated 18 representative world models covering Text‑to‑Video (T2V), Image‑to‑Video (I2V), and camera‑controlled settings. The evaluation used 410 text prompts, 120 trajectory‑based prompts, and 15 metric components.

Key results:

Cosmos achieved the highest overall AgenticScore of 75.92 % (second‑best 75.42 %).

Among pure T2V models, HunyuanVideo scored 73.96 %.

For camera‑control models, HunyuanWorld (74.36 %) and WonderWorld (74.02 %) were top performers.

Across all models, traditional video quality metrics (resolution, flicker, motion smoothness) were strong, but performance dropped markedly on interaction fidelity, long‑term state evolution, and joint camera‑object control.

Significance

Omni‑WorldBench shifts evaluation from pure visual fidelity to interaction fidelity, providing a hierarchical prompt suite, a structured multi‑dimensional metric system, and an adaptive aggregation mechanism. The benchmark quantifies current limitations of world models and highlights the next research frontier: generating trustworthy, causally consistent world changes rather than merely aesthetically pleasing videos.