Do Video Generation Models Really Reason? A 303‑Question Benchmark Exposes Their Reasoning Gaps
The paper introduces the Reasoning Coherence metric and the MME‑CoF‑Pro benchmark—303 image‑text‑video samples across 16 reasoning categories—to evaluate seven leading video generation models, revealing that reasoning ability is largely independent of visual quality, that textual prompts often induce hallucinations, and that the new Reasoning Score aligns well with human judgments.
