VBench-2.0: A Next‑Generation Benchmark for Intrinsic Faithfulness in AI Video Generation
VBench-2.0 expands the original VBench suite by introducing six fine‑grained dimensions—Human Fidelity, Controllability, Creativity, Physics, Commonsense, and more—to evaluate not only the visual quality of generated videos but also their intrinsic faithfulness to physical laws, common sense, and narrative coherence, providing open‑source tools, prompts, and human‑aligned metrics for the research community.
In the past year AI video generation has progressed rapidly, highlighted by the release of Sora in early 2024. While many closed‑source models (e.g., Kling, Gen, Pika) deliver impressive visual fidelity, open‑source projects such as HunyuanVideo and Wanx also rank highly on the original VBench leaderboard, demonstrating the community’s potential to drive innovation.
From Superficial to Intrinsic Faithfulness
The first version of VBench focused on Superficial Faithfulness —frame‑level clarity, smooth transitions, and basic text‑video alignment. This metric answers the question “does the video look realistic?” and provides a unified scale for current models.
VBench‑2.0, jointly released by Nanyang Technological University’s S‑Lab and Shanghai AI Lab, adds a new layer of evaluation called Intrinsic Faithfulness . It measures a model’s understanding of world‑model aspects such as physics, common‑sense reasoning, human anatomy, and scene composition, which are essential for applications like AI‑assisted filmmaking and complex simulation.
Key Evaluation Dimensions
Human Fidelity : assesses whether human motions are anatomically plausible (e.g., gymnastics routines).
Controllability : checks if the model follows precise user instructions such as camera moves or character placement.
Creativity : evaluates imagination in scene composition and story extension.
Physics : verifies realistic handling of gravity, buoyancy, collisions, etc.
Commonsense : tests logical consistency in everyday scenarios (e.g., food actually entering the mouth).
VBench‑2.0 provides a large collection of fine‑grained test cases and automated scoring pipelines for each dimension. Human annotations were collected at scale to align automatic scores with human perception.
Methodology and Human Alignment
For every dimension, the authors compute the Pearson correlation between the automatic metric and human ratings, demonstrating high alignment across all categories. The correlation plots (horizontal axis: human scores, vertical axis: VBench‑2.0 scores) show that the benchmark reliably mirrors human judgment.
Empirical Findings
Radar charts (normalized to 0.3–0.8) compare open‑source and closed‑source models on VBench‑2.0. No clear dominance of closed‑source models emerges; many community projects perform comparably on intrinsic dimensions, indicating that progress does not rely solely on proprietary resources.
Specific model recommendations:
For highly creative, out‑of‑the‑box content: Sora .
For human‑centric motion and fine‑grained camera control: Kling 1.6 or HunyuanVideo .
For strict text‑to‑video adherence and physics compliance: CogVideoX‑1.5 .
Current limitations include poor handling of simple position or attribute changes—likely due to insufficient caption‑style training data—and the inability to generate story‑level (multi‑second to minute) videos, as most models are capped at 5–10 seconds.
Prompt Refiner Trade‑offs
Using a prompt‑refiner improves visual quality and text alignment but can suppress diversity and creativity. Researchers are encouraged to toggle the refiner depending on whether quality or creative variance is the priority.
Resources
Paper: VBench‑2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness (https://arxiv.org/abs/2503.21755)
Code repository: https://github.com/Vchitect/VBench
Project page: https://github.com/Vchitect/VBench-2.0-project
Prompt list: https://github.com/Vchitect/VBench/tree/master/VBench-2.0/prompts
The authors invite researchers and developers to adopt both VBench‑1.0 (surface fidelity) and VBench‑2.0 (intrinsic fidelity) for a comprehensive assessment of video generation models, and to contribute to the open‑source ecosystem to push the field toward truly realistic, world‑aware video synthesis.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
