Artificial Intelligence 11 min read

Do Video Generation Models Really Reason? A 303‑Question Benchmark Exposes Their Reasoning Gaps

The article introduces the MME‑CoF‑Pro benchmark, which uses 303 carefully crafted video‑reasoning samples across 16 categories to evaluate seven leading video generation models, revealing that current models lack true reasoning ability, that prompting can both help and hurt coherence, and that the new Reasoning Score aligns well with human judgments.

Data Party THU

Jun 30, 2026

Do Video Generation Models Really Reason? A 303‑Question Benchmark Exposes Their Reasoning Gaps

Reasoning Coherence Definition

Reasoning Coherence (Reasoning Consistency) is defined as the ability of generated video events to maintain causal and believable evolution across frames.

Reasoning Score (RS)

RS is a process‑level metric. For each sample, a chain of human‑verified key reasoning steps is annotated. RS equals the proportion of steps correctly completed, judged automatically by a separate model (Gemini‑2.5‑Flash) that evaluates each step independently. RS therefore localizes failures within the reasoning chain.

MME‑CoF‑Pro Benchmark

MME‑CoF‑Pro extends the earlier MME‑CoF (arXiv:2510.26802) from 12 to 16 reasoning categories, providing 303 image‑text‑video samples and 370 images. The 16 categories are grouped into four ability sets: Perceptual (visual detail, rotation, object counting), Spatial & Structural (trajectories, 2D/3D geometry), Physical & Causal (physics, 4D dynamics, natural science), and Task‑oriented (embodied manipulation, GUI interaction, medical imaging, charts, text/code, visual logic). Each sample undergoes three rounds of expert verification.

Each sample is evaluated under three hint settings:

No Hint : model must infer solely from the task instruction.

Text Hint : the instruction is supplemented with textual descriptions of critical reasoning steps.

Visual Hint : for the eight most perceptually demanding categories, the input image is annotated with bounding boxes, arrows, or trajectories to guide the model.

Because only the hint varies while all other instructions remain identical, performance differences can be causally attributed to the hint type.

Evaluation Experiments

Seven state‑of‑the‑art video generation models were evaluated: Veo‑3.1, Veo‑3.1‑fast, Sora‑2, Seedance‑1.0‑pro, Seedance‑1.0‑fast, Kling‑v2.1, and Cosmos‑Predict2‑14B. Each model was run under the three hint settings and scored with RS and a consistency score (CS) that measures frame‑wise coherence.

Finding 1 – Reasoning ability is weak and decoupled from visual quality

The best RS achieved by any model is 56 (Veo‑3.1). Sora‑2 scores 50. High‑quality models can have low RS; for example, Kling‑v2.1 attains an average visual quality of 65.1 but an RS of only 13.8, indicating that visual fidelity does not imply reasoning competence.

Finding 2 – Text hints improve RS but often degrade consistency

Adding Text Hints raises RS for most models (e.g., Veo‑3.1 + 4.5, Sora‑2 + 7.6, Cosmos + 6.7). However, CS drops across all seven models, especially on 4D dynamics tasks where CS declines between –1.2 and –15.6 points. Models tend to follow the literal text without deeper understanding, sometimes generating spurious objects to satisfy motion instructions.

Finding 3 – Visual hints help spatial/structural tasks but hurt fine‑grained perception

Visual Hints improve performance on spatial/structural categories (Embodied, GUI) but reduce scores on fine‑grained perception tasks such as visual detail and object counting (e.g., Veo‑3.1 RS – 13.0, CS – 14.4). Models frequently render the visual hint itself as content (e.g., arrows become objects), suggesting a training‑data bias where annotated cues are associated with target objects.

Case Study: Scaling Prompts on Frozen Lake

A scaling experiment with Sora‑2 on the Frozen Lake task adds hints incrementally. Both Text and Visual Hints raise RS above the no‑hint baseline (0.23), but the RS curves fluctuate heavily and show no monotonic increase. This demonstrates that simply stacking more prompts does not guarantee stable reasoning improvement.

Human Study: Reliability of Reasoning Score

Ten annotators scored randomly sampled videos according to the RS steps. RS achieved a Spearman correlation of 0.61 with human scores, far higher than Instruction Alignment (0.17) and negatively correlated (‑0.41) with Pass@5 last‑frame correctness. The result confirms RS as an effective, model‑agnostic indicator of reasoning consistency.

Conclusion

The systematic evaluation shows that current video generation models mainly follow prompts without truly understanding world dynamics. Advancing visual alignment, instruction comprehension, and hallucination mitigation remains essential for building genuine world‑model reasoning in video generation.

Resources

Paper: https://arxiv.org/abs/2603.20194v1

Project homepage: https://video-reasoning-coherence.github.io/

Dataset on HuggingFace: https://huggingface.co/datasets/yqi19/mme-cof-pro

GitHub repository: https://github.com/yqi19/MME-CoF-Pro

Code example

来源：机器之心
本文
约2500字
，建议阅读
5
分钟
视频模型不懂推理、更听不懂提示。

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

artificial intelligence video generation benchmark Evaluation MME-CoF-Pro reasoning coherence

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Reasoning Coherence Definition

Reasoning Score (RS)

MME‑CoF‑Pro Benchmark

Evaluation Experiments

Finding 1 – Reasoning ability is weak and decoupled from visual quality

Finding 2 – Text hints improve RS but often degrade consistency

Finding 3 – Visual hints help spatial/structural tasks but hurt fine‑grained perception

Case Study: Scaling Prompts on Frozen Lake

Human Study: Reliability of Reasoning Score

Conclusion

Resources

Code example

Data Party THU

How this landed with the community

Was this worth your time?

0 Comments

Finding 1 – Reasoning ability is weak and decoupled from visual quality

Finding 2 – Text hints improve RS but often degrade consistency

Finding 3 – Visual hints help spatial/structural tasks but hurt fine‑grained perception

Case Study: Scaling Prompts on Frozen Lake