Do Video Generation Models Really Reason? A 303‑Question Benchmark Exposes Their Reasoning Gaps
The article introduces the MME‑CoF‑Pro benchmark, which uses 303 carefully crafted video‑reasoning samples across 16 categories to evaluate seven leading video generation models, revealing that current models lack true reasoning ability, that prompting can both help and hurt coherence, and that the new Reasoning Score aligns well with human judgments.
Reasoning Coherence Definition
Reasoning Coherence (Reasoning Consistency) is defined as the ability of generated video events to maintain causal and believable evolution across frames.
Reasoning Score (RS)
RS is a process‑level metric. For each sample, a chain of human‑verified key reasoning steps is annotated. RS equals the proportion of steps correctly completed, judged automatically by a separate model (Gemini‑2.5‑Flash) that evaluates each step independently. RS therefore localizes failures within the reasoning chain.
MME‑CoF‑Pro Benchmark
MME‑CoF‑Pro extends the earlier MME‑CoF (arXiv:2510.26802) from 12 to 16 reasoning categories, providing 303 image‑text‑video samples and 370 images. The 16 categories are grouped into four ability sets: Perceptual (visual detail, rotation, object counting), Spatial & Structural (trajectories, 2D/3D geometry), Physical & Causal (physics, 4D dynamics, natural science), and Task‑oriented (embodied manipulation, GUI interaction, medical imaging, charts, text/code, visual logic). Each sample undergoes three rounds of expert verification.
Each sample is evaluated under three hint settings:
No Hint : model must infer solely from the task instruction.
Text Hint : the instruction is supplemented with textual descriptions of critical reasoning steps.
Visual Hint : for the eight most perceptually demanding categories, the input image is annotated with bounding boxes, arrows, or trajectories to guide the model.
Because only the hint varies while all other instructions remain identical, performance differences can be causally attributed to the hint type.
Evaluation Experiments
Seven state‑of‑the‑art video generation models were evaluated: Veo‑3.1, Veo‑3.1‑fast, Sora‑2, Seedance‑1.0‑pro, Seedance‑1.0‑fast, Kling‑v2.1, and Cosmos‑Predict2‑14B. Each model was run under the three hint settings and scored with RS and a consistency score (CS) that measures frame‑wise coherence.
Finding 1 – Reasoning ability is weak and decoupled from visual quality
The best RS achieved by any model is 56 (Veo‑3.1). Sora‑2 scores 50. High‑quality models can have low RS; for example, Kling‑v2.1 attains an average visual quality of 65.1 but an RS of only 13.8, indicating that visual fidelity does not imply reasoning competence.
Finding 2 – Text hints improve RS but often degrade consistency
Adding Text Hints raises RS for most models (e.g., Veo‑3.1 + 4.5, Sora‑2 + 7.6, Cosmos + 6.7). However, CS drops across all seven models, especially on 4D dynamics tasks where CS declines between –1.2 and –15.6 points. Models tend to follow the literal text without deeper understanding, sometimes generating spurious objects to satisfy motion instructions.
Finding 3 – Visual hints help spatial/structural tasks but hurt fine‑grained perception
Visual Hints improve performance on spatial/structural categories (Embodied, GUI) but reduce scores on fine‑grained perception tasks such as visual detail and object counting (e.g., Veo‑3.1 RS – 13.0, CS – 14.4). Models frequently render the visual hint itself as content (e.g., arrows become objects), suggesting a training‑data bias where annotated cues are associated with target objects.
Case Study: Scaling Prompts on Frozen Lake
A scaling experiment with Sora‑2 on the Frozen Lake task adds hints incrementally. Both Text and Visual Hints raise RS above the no‑hint baseline (0.23), but the RS curves fluctuate heavily and show no monotonic increase. This demonstrates that simply stacking more prompts does not guarantee stable reasoning improvement.
Human Study: Reliability of Reasoning Score
Ten annotators scored randomly sampled videos according to the RS steps. RS achieved a Spearman correlation of 0.61 with human scores, far higher than Instruction Alignment (0.17) and negatively correlated (‑0.41) with Pass@5 last‑frame correctness. The result confirms RS as an effective, model‑agnostic indicator of reasoning consistency.
Conclusion
The systematic evaluation shows that current video generation models mainly follow prompts without truly understanding world dynamics. Advancing visual alignment, instruction comprehension, and hallucination mitigation remains essential for building genuine world‑model reasoning in video generation.
Resources
Paper: https://arxiv.org/abs/2603.20194v1
Project homepage: https://video-reasoning-coherence.github.io/
Dataset on HuggingFace: https://huggingface.co/datasets/yqi19/mme-cof-pro
GitHub repository: https://github.com/yqi19/MME-CoF-Pro
Code example
来源:机器之心
本文
约2500字
,建议阅读
5
分钟
视频模型不懂推理、更听不懂提示。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
