Artificial Intelligence 14 min read

Unveiling Omni-WorldBench: How 18 AI Video Models Stack Up on 4D Interaction Tests

The Omni-WorldBench framework introduces a comprehensive 4D evaluation suite with 1,068 test cases and three interaction levels, applying novel metrics to assess video quality, controllability, and physical interaction fidelity across 18 state‑of‑the‑art AI video models, revealing strengths, weaknesses, and future research directions.

SuanNi

Mar 26, 2026

Unveiling Omni-WorldBench: How 18 AI Video Models Stack Up on 4D Interaction Tests

Overview

Omni-WorldBench is a newly released open‑source benchmark that evaluates the physical realism and interactive capabilities of AI video generation models. It exposes the gap between visually impressive AI videos and their adherence to real‑world physics, providing a systematic way to measure 4D interaction performance.

Evaluation Perspective Shift

The core goal of a world model is to predict how an environment evolves under specific actions, supporting counterfactual reasoning, planning, and decision‑making. While video generation has advanced rapidly, existing benchmarks focus only on static visual fidelity (e.g., FID, FVD) and ignore temporal dynamics and interaction logic.

Omni-WorldBench Framework

The framework consists of two main components:

Omni-WorldSuite : a systematic prompt library covering multiple interaction levels and scenario types.

Omni-Metric : an agent‑based quantitative evaluation pipeline that measures interaction fidelity, video quality, controllability, and a multimodal model‑fused comprehensive score.

Test Suite Construction

Omni-WorldSuite contains 1,068 test cases organized into three interaction tiers:

Tier 1: actions affect only the initiating object (e.g., looking through a crystal ball).

Tier 2: actions cause local effects on another object (e.g., heating a metal rod in a fire).

Tier 3: actions produce global changes involving multiple objects (e.g., breaking spaghetti, rearranging a room).

The suite spans everyday physics, autonomous driving, embodied robotics, and game simulation. Test cases are generated via two complementary strategies:

Real‑data‑driven generation using DriveLM (autonomous driving), InternData‑A1 (embodied robot tasks), and Sekai (game environments). Prompts are refined by Qwen‑VL and human verification.

Concept‑driven generation that builds a prototype concept library, samples attribute combinations, and uses ChatGPT‑5.2, Gemini, and DeepSeek‑R1 for text, trajectory, and image synthesis, followed by human review.

High‑quality images are produced with FLUX.1‑dev (three candidates per prompt) and filtered for physical plausibility, instruction compliance, and visual quality; some are further refined by Qwen‑Image. All selected images have a resolution of at least 1024 × 1024.

Metric Design

Omni‑Metric defines three independent but complementary evaluation dimensions:

Video Quality : uses established objective measures (image fidelity, temporal flicker, motion smoothness, dynamic degree, content alignment) to assess visual continuity.

Camera & Object Controllability : quantifies scene coherence and object stability without external intervention. Camera metrics evaluate rotation/translation errors; object metrics turn consistency checking into a visual‑question‑answer task, avoiding synonym‑based misinterpretations.

Interaction Fidelity : the core dimension, composed of four quantitative sub‑metrics: InterStab‑L: measures long‑term temporal coherence using SSIM and CLIP embeddings with dynamic gating to penalize static videos. InterStab‑N: evaluates stability of non‑target regions via optical flow energy after masking out target objects. InterCov: checks causal realism of affected objects using a multimodal LLM as a semantic verifier. InterOrder: verifies that event sequences follow correct physical order.

Scores from each dimension are aggregated using an agentic scoring mechanism called AgenticScore. Each metric acts as an independent agent; a multimodal LLM then dynamically weights video quality, controllability, and fidelity to produce a final comprehensive score.

Model Evaluation Results

Eighteen leading video models—including Director3D, OpenSoraPlan, T2V‑Turbo, HunyuanVideo, Matrix‑Game2.0, Wan2.1/2.2, CogVideo, OpenSora, Cosmos, LargeVideoPlanner, HunyuanWorld, HunyuanGameCraft, ViewCrafter, Gen3c, Lingbot, FantasyWorld, and WonderWorld—were benchmarked on a H20 compute cluster using each model’s official parameters.

Key findings:

Models that accept rich image inputs (e.g., Wan2.2, Cosmos) achieved the highest overall AgenticScore (~75%).

Pure text‑to‑video models performed well on logical consistency (e.g., HunyuanVideo at 73.96%).

Camera‑controlled models such as HunyuanWorld and WonderWorld led in camera‑related metrics.

Most models excel in visual quality, with temporal flicker and motion smoothness often exceeding 95%.

Interaction fidelity exposed the biggest gaps: Wan2.2 topped the InterStab‑L score (84.96%) but many models, including WonderWorld, suffered dramatic drops in InterStab‑N (24.89%).

Overall, current video models achieve high static scores but still struggle to maintain physical and causal consistency during complex interactions.

The benchmark demonstrates that building truly interactive 4D world models—capable of controlled evolution, causal reasoning, and flexible camera manipulation—remains an open research challenge.