From Moonwalks to Cyber Cities: How WBench Maps the Limits of World Models

WBench, the first systematic multi‑turn benchmark for interactive video world models, evaluates 20 cutting‑edge models across navigation, actions, editing and view‑switching, revealing that no single model excels at all tasks, navigation is independent of visual quality, and multi‑turn interaction causes a 33‑point drop in performance.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
From Moonwalks to Cyber Cities: How WBench Maps the Limits of World Models

WBench benchmark

WBench is a systematic multi‑turn benchmark for interactive video world models. It consists of four components: World Definition, Instruction Set, Unified Interaction Interface, and Evaluation Suite.

Dataset: 289 test cases, 1 058 interaction rounds, covering navigation, primary actions, event editing, and view‑switching.

Metrics

NavScore – measures navigation accuracy.

Gated Spatial Consistency – measures consistency across frames.

Additional dimensions: video quality (Qual), setting adherence, interaction adherence, physical realism.

Automatic scores correlate with human judgments (Spearman ρ ≥ 0.94 over 400 annotators).

Evaluation results

No universal model: text‑driven models excel at scene understanding; dedicated world models excel at interactive control.

Navigation ability is largely independent of video quality.

All models suffer a steep drop in navigation performance during continuous interaction; average NavScore decreases by 33 points from round 1 to round 4.

Open‑source models can outperform closed‑source ones on specific abilities; HY‑World 1.5 leads in navigation.

For precise intent understanding, Kling 3.0 and Wan 2.7 rank highest; for smooth camera control, HY‑World 1.5 and Genie 3 are superior.

Consistency is best handled by LingBot‑World; physical realism and causal reasoning are strongest in Wan 2.7.

View‑switching is the hardest interaction type, average score 30.7.

Correlation analysis

The correlation matrix shows navigation scores have near‑zero correlation with other dimensions (Qual, Consistency, etc.), indicating navigation relies on a separate spatial‑state representation.

World‑setting impact

First‑person viewpoint (z = +1.0) simplifies navigation but makes setting consistency harder; animal subjects (z = ‑1.9) present the greatest difficulty.

Resources

Paper: https://huggingface.co/papers/2605.25874

GitHub: https://github.com/meituan-longcat/WBench

Homepage: https://meituan-longcat.github.io/WBench/

Dataset: https://huggingface.co/datasets/meituan-longcat/WBench

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Open SourcebenchmarkAI EvaluationNavigationInteractive Videoworld models
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.