WBench: 20 Cutting‑Edge World Models Face a Comprehensive Interactive Benchmark
WBench, a new benchmark created by Meituan LongCat and Fudan University, evaluates 20 state‑of‑the‑art video and world‑model systems across 289 test cases and 1,058 interaction rounds, measuring video quality, setting adherence, interaction fidelity, consistency and physical compliance, and reveals that no model yet excels in all five dimensions.
From Generation to Interaction
Recent video models have progressed from generating short clips to simulating interactive worlds, but the challenge now is to keep the world consistent while continuously receiving user actions.
WBench Design
WBench is built around four components—world definition, instruction set, a unified interaction interface, and an evaluation suite—that answer the questions “what is the world”, “what does the user want to do”, “how to feed different models fairly”, and “how to quantify the results”.
Dataset Composition
The benchmark contains 289 test cases and 1,058 interaction rounds, covering first‑person and third‑person views, four interaction types (navigation, subject motion, event editing, view switching), and a wide variety of scenes, styles, subjects and camera angles.
Evaluation Protocol
Each navigation task is expressed in three aligned forms—text description, camera pose, and discrete action—so that models with different native interfaces can be compared on the same spatial control requirement. An adaptive reference‑trajectory mechanism scales the reference path to the model’s predicted motion, reducing bias from differing motion scales. Human validation with 400 crowd workers shows Spearman correlation ≥ 0.94 for all automatic scores, confirming the reliability of the metrics.
Benchmark Results
Twenty cutting‑edge models (9 text‑driven, 5 camera‑control, 6 action‑conditioned) were evaluated. No model dominates all five dimensions (video quality, setting adherence, interaction adherence, consistency, physical compliance). The highest navigation score (87.5) belongs to HY‑World 1.5, while LingBot‑World achieves the top consistency score (89.9). Text‑driven models tend to excel at setting adherence and semantic interaction, whereas camera‑control models lead navigation but lag in view‑consistency. View‑switching remains the hardest semantic task, with an average score of only 30.7.
Multi‑Round Degradation
Performance deteriorates sharply with more interaction rounds. Navigation drops by 33 points from round 1 to round 4+, event editing by 13 points, and subject motion by 9 points. Errors accumulate because pose deviations from earlier rounds propagate, causing trajectory drift or direction errors.
Correlation and Difficulty Analysis
Physical compliance correlates strongly with video quality (r = 0.84) but weakly (negative) with navigation (r = ‑0.15). First‑person navigation is easier; animal subjects increase difficulty. WBench therefore pinpoints which world settings cause specific failures.
Conclusion
WBench separates rendering, setting, interaction, memory and physical causality into quantifiable metrics, providing a diagnostic tool for research iteration and model selection. While video models now generate plausible worlds, maintaining a stable, controllable interactive environment remains an open challenge.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
