Artificial Intelligence 12 min read

VBench-2.0: A Next‑Generation Benchmark for Intrinsic Faithfulness in AI Video Generation

VBench-2.0 expands the original VBench suite by introducing six fine‑grained dimensions—Human Fidelity, Controllability, Creativity, Physics, Commonsense, and more—to evaluate not only the visual quality of generated videos but also their intrinsic faithfulness to physical laws, common sense, and narrative coherence, providing open‑source tools, prompts, and human‑aligned metrics for the research community.

AIWalker

Mar 31, 2025

VBench-2.0: A Next‑Generation Benchmark for Intrinsic Faithfulness in AI Video Generation

In the past year AI video generation has progressed rapidly, highlighted by the release of Sora in early 2024. While many closed‑source models (e.g., Kling, Gen, Pika) deliver impressive visual fidelity, open‑source projects such as HunyuanVideo and Wanx also rank highly on the original VBench leaderboard, demonstrating the community’s potential to drive innovation.

From Superficial to Intrinsic Faithfulness

The first version of VBench focused on Superficial Faithfulness —frame‑level clarity, smooth transitions, and basic text‑video alignment. This metric answers the question “does the video look realistic?” and provides a unified scale for current models.

VBench‑2.0, jointly released by Nanyang Technological University’s S‑Lab and Shanghai AI Lab, adds a new layer of evaluation called Intrinsic Faithfulness . It measures a model’s understanding of world‑model aspects such as physics, common‑sense reasoning, human anatomy, and scene composition, which are essential for applications like AI‑assisted filmmaking and complex simulation.

Key Evaluation Dimensions

Human Fidelity : assesses whether human motions are anatomically plausible (e.g., gymnastics routines).

Controllability : checks if the model follows precise user instructions such as camera moves or character placement.

Creativity : evaluates imagination in scene composition and story extension.

Physics : verifies realistic handling of gravity, buoyancy, collisions, etc.

Commonsense : tests logical consistency in everyday scenarios (e.g., food actually entering the mouth).

VBench‑2.0 provides a large collection of fine‑grained test cases and automated scoring pipelines for each dimension. Human annotations were collected at scale to align automatic scores with human perception.

Methodology and Human Alignment

For every dimension, the authors compute the Pearson correlation between the automatic metric and human ratings, demonstrating high alignment across all categories. The correlation plots (horizontal axis: human scores, vertical axis: VBench‑2.0 scores) show that the benchmark reliably mirrors human judgment.

Empirical Findings

Radar charts (normalized to 0.3–0.8) compare open‑source and closed‑source models on VBench‑2.0. No clear dominance of closed‑source models emerges; many community projects perform comparably on intrinsic dimensions, indicating that progress does not rely solely on proprietary resources.

Specific model recommendations:

For highly creative, out‑of‑the‑box content: Sora .

For human‑centric motion and fine‑grained camera control: Kling 1.6 or HunyuanVideo .

For strict text‑to‑video adherence and physics compliance: CogVideoX‑1.5 .

Current limitations include poor handling of simple position or attribute changes—likely due to insufficient caption‑style training data—and the inability to generate story‑level (multi‑second to minute) videos, as most models are capped at 5–10 seconds.

Prompt Refiner Trade‑offs

Using a prompt‑refiner improves visual quality and text alignment but can suppress diversity and creativity. Researchers are encouraged to toggle the refiner depending on whether quality or creative variance is the priority.

Resources

Paper: VBench‑2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness (https://arxiv.org/abs/2503.21755)

Code repository: https://github.com/Vchitect/VBench

Project page: https://github.com/Vchitect/VBench-2.0-project

Prompt list: https://github.com/Vchitect/VBench/tree/master/VBench-2.0/prompts

The authors invite researchers and developers to adopt both VBench‑1.0 (surface fidelity) and VBench‑2.0 (intrinsic fidelity) for a comprehensive assessment of video generation models, and to contribute to the open‑source ecosystem to push the field toward truly realistic, world‑aware video synthesis.

video generation benchmark AI evaluation multimodal VBench Intrinsic Faithfulness

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.