Artificial Intelligence 10 min read

WBench: 20 Cutting‑Edge World Models Face a Comprehensive Interactive Benchmark

WBench, a new benchmark created by Meituan LongCat and Fudan University, evaluates 20 state‑of‑the‑art video and world‑model systems across 289 test cases and 1,058 interaction rounds, measuring video quality, setting adherence, interaction fidelity, consistency and physical compliance, and reveals that no model yet excels in all five dimensions.

Machine Learning Algorithms & Natural Language Processing

May 29, 2026

WBench: 20 Cutting‑Edge World Models Face a Comprehensive Interactive Benchmark

From Generation to Interaction

Recent video models have progressed from generating short clips to simulating interactive worlds, but the challenge now is to keep the world consistent while continuously receiving user actions.

WBench Design

WBench is built around four components—world definition, instruction set, a unified interaction interface, and an evaluation suite—that answer the questions “what is the world”, “what does the user want to do”, “how to feed different models fairly”, and “how to quantify the results”.

Dataset Composition

The benchmark contains 289 test cases and 1,058 interaction rounds, covering first‑person and third‑person views, four interaction types (navigation, subject motion, event editing, view switching), and a wide variety of scenes, styles, subjects and camera angles.

Evaluation Protocol

Each navigation task is expressed in three aligned forms—text description, camera pose, and discrete action—so that models with different native interfaces can be compared on the same spatial control requirement. An adaptive reference‑trajectory mechanism scales the reference path to the model’s predicted motion, reducing bias from differing motion scales. Human validation with 400 crowd workers shows Spearman correlation ≥ 0.94 for all automatic scores, confirming the reliability of the metrics.

Benchmark Results

Twenty cutting‑edge models (9 text‑driven, 5 camera‑control, 6 action‑conditioned) were evaluated. No model dominates all five dimensions (video quality, setting adherence, interaction adherence, consistency, physical compliance). The highest navigation score (87.5) belongs to HY‑World 1.5, while LingBot‑World achieves the top consistency score (89.9). Text‑driven models tend to excel at setting adherence and semantic interaction, whereas camera‑control models lead navigation but lag in view‑consistency. View‑switching remains the hardest semantic task, with an average score of only 30.7.

Multi‑Round Degradation

Performance deteriorates sharply with more interaction rounds. Navigation drops by 33 points from round 1 to round 4+, event editing by 13 points, and subject motion by 9 points. Errors accumulate because pose deviations from earlier rounds propagate, causing trajectory drift or direction errors.

Correlation and Difficulty Analysis

Physical compliance correlates strongly with video quality (r = 0.84) but weakly (negative) with navigation (r = ‑0.15). First‑person navigation is easier; animal subjects increase difficulty. WBench therefore pinpoints which world settings cause specific failures.

Conclusion

WBench separates rendering, setting, interaction, memory and physical causality into quantifiable metrics, providing a diagnostic tool for research iteration and model selection. While video models now generate plausible worlds, maintaining a stable, controllable interactive environment remains an open challenge.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

video generation consistency Multimodal Evaluation world models Interactive Benchmark WBench

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.