From Moonwalks to Cyber Cities: How WBench Maps the Limits of World Models
WBench, the first systematic multi‑turn benchmark for interactive video world models, evaluates 20 cutting‑edge models across navigation, actions, editing and view‑switching, revealing that no single model excels at all tasks, navigation is independent of visual quality, and multi‑turn interaction causes a 33‑point drop in performance.
WBench benchmark
WBench is a systematic multi‑turn benchmark for interactive video world models. It consists of four components: World Definition, Instruction Set, Unified Interaction Interface, and Evaluation Suite.
Dataset: 289 test cases, 1 058 interaction rounds, covering navigation, primary actions, event editing, and view‑switching.
Metrics
NavScore – measures navigation accuracy.
Gated Spatial Consistency – measures consistency across frames.
Additional dimensions: video quality (Qual), setting adherence, interaction adherence, physical realism.
Automatic scores correlate with human judgments (Spearman ρ ≥ 0.94 over 400 annotators).
Evaluation results
No universal model: text‑driven models excel at scene understanding; dedicated world models excel at interactive control.
Navigation ability is largely independent of video quality.
All models suffer a steep drop in navigation performance during continuous interaction; average NavScore decreases by 33 points from round 1 to round 4.
Open‑source models can outperform closed‑source ones on specific abilities; HY‑World 1.5 leads in navigation.
For precise intent understanding, Kling 3.0 and Wan 2.7 rank highest; for smooth camera control, HY‑World 1.5 and Genie 3 are superior.
Consistency is best handled by LingBot‑World; physical realism and causal reasoning are strongest in Wan 2.7.
View‑switching is the hardest interaction type, average score 30.7.
Correlation analysis
The correlation matrix shows navigation scores have near‑zero correlation with other dimensions (Qual, Consistency, etc.), indicating navigation relies on a separate spatial‑state representation.
World‑setting impact
First‑person viewpoint (z = +1.0) simplifies navigation but makes setting consistency harder; animal subjects (z = ‑1.9) present the greatest difficulty.
Resources
Paper: https://huggingface.co/papers/2605.25874
GitHub: https://github.com/meituan-longcat/WBench
Homepage: https://meituan-longcat.github.io/WBench/
Dataset: https://huggingface.co/datasets/meituan-longcat/WBench
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
