Do Video Generation Models Really Reason? A 303‑Question Benchmark Exposes Their Reasoning Gaps
The paper introduces the Reasoning Coherence metric and the MME‑CoF‑Pro benchmark—303 image‑text‑video samples across 16 reasoning categories—to evaluate seven leading video generation models, revealing that reasoning ability is largely independent of visual quality, that textual prompts often induce hallucinations, and that the new Reasoning Score aligns well with human judgments.
Background
Video generative models such as Sora, Veo, Kling and Seedance have demonstrated impressive visual fidelity, leading many to assume they have learned an implicit world model. The authors argue that a critical, previously ignored question is whether these models perform coherent frame‑by‑frame reasoning, i.e., maintain causal consistency across generated frames.
Reasoning Coherence
The authors formally define Reasoning Coherence (Reasoning Coherence) as the ability of a generated video to preserve causally consistent and believable evolution of events from frame to frame.
MME‑CoF‑Pro Benchmark
Building on their earlier MME‑CoF work (arXiv:2510.26802, CVPR 2026 Findings), the ECCV 2026‑accepted MME‑CoF‑Pro expands the benchmark to 16 reasoning categories organized into four capability groups (Perceptual, Spatial & Structural, Physical & Causal, Task‑oriented). It contains 303 carefully crafted image‑text‑video samples and 370 images, each vetted through three rounds of expert annotation.
Each sample is evaluated under three hint conditions:
No Hint : the model receives only the task instruction.
Text Hint : the instruction is supplemented with a textual description of the required reasoning steps.
Visual Hint : for the eight most perception‑demanding categories (MME‑CoF‑Pro‑mini), a visual cue (bounding box, arrow, trajectory) is added in addition to the text hint.
Because only the hint varies while all other inputs remain identical, any performance differences can be causally attributed to the hint itself.
Reasoning Score (RS)
Traditional video metrics assess only final‑frame quality. The authors propose a process‑level metric, the Reasoning Score , which marks a sequence of manually annotated checkpoints for each sample. RS is the proportion of checkpoints correctly generated, automatically judged by a separate discriminator model (Gemini‑2.5‑Flash). RS thus pinpoints exactly where a model’s reasoning chain collapses and enables reliable cross‑model comparison.
Experimental Evaluation
The study evaluates seven state‑of‑the‑art closed‑source and open‑source models (Veo‑3.1, Veo‑3.1‑fast, Sora‑2, Seedance‑1.0‑pro, Seedance‑1.0‑fast, Kling‑v2.1, Cosmos‑Predict2‑14B) under all three hint settings.
Finding 1: Video models generally lack strong reasoning ability, and reasoning performance is almost completely decoupled from visual quality. The best model, Veo‑3.1, scores 56 RS points, while Sora‑2 scores 50; even the highest‑quality model Kling achieves an average visual score of 65.1 but a mere 13.8 RS, illustrating that high fidelity does not imply reasoning.
Finding 2: Textual hints act as a double‑edged sword. Although most models improve RS with text hints (e.g., Veo‑3.1 + 4.5, Sora‑2 + 7.6, Cosmos + 6.7), their Consistency Score (CS) drops sharply, especially on 4D dynamics where all seven models lose 1.2–15.6 CS points. Models tend to follow literal instructions, sometimes generating spurious objects to satisfy motion commands.
Finding 3: Visual hints are not universally beneficial. They help on spatially guided tasks (Embodied, GUI) but degrade performance on fine‑grained perception tasks such as object counting (e.g., Veo‑3.1 RS − 13.0, CS − 14.4). Models often incorporate the visual cue itself into the output (e.g., drawing arrows as objects), likely due to training data bias where highlighted arrows co‑occur with synthetic content.
Case Study: Scaling Prompts
Using Sora‑2 on a Frozen Lake task, the authors incrementally add prompts. While both text and visual hints raise RS above the no‑hint baseline (≈0.23), the performance curves fluctuate wildly with no clear upward trend, indicating that simply stacking prompts does not guarantee stable reasoning improvements.
Human Validation of Reasoning Score
Ten annotators scored randomly sampled videos according to the annotated checkpoints. RS achieved a Spearman correlation of 0.61 with human scores—substantially higher than Instruction Alignment (0.17) and negatively correlated with Pass@5 last‑frame correctness (‑0.41). This confirms RS as a reliable indicator of human‑perceived reasoning.
Conclusion
The systematic evaluation shows that current video generation models mainly follow prompts rather than truly understand and apply world dynamics. Advancing towards genuine world‑model reasoning will require stronger visual alignment, better instruction comprehension, and robust hallucination mitigation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
