Physion-Eval Reveals Why Visually Realistic AI Videos Still Miss Physical Reality
Physion-Eval, a new benchmark with nearly 11,000 expert‑annotated video clips, shows that most current AI‑generated videos look realistic but frequently violate basic physics, and that even top multimodal models fail to reliably detect these physical errors.
Recent advances in video‑generation models have dramatically improved visual fidelity, stability, and naturalness, yet judging a model solely by how "real" it looks ignores whether the generated scenes obey physical laws. The authors argue that visual realism is only half the story; true world‑modeling requires consistent physics.
To address this gap, the paper introduces Physion‑Eval , a benchmark designed to evaluate the physical realism of AI‑generated videos. It contains 10,990 expert‑reasoning trajectories covering 22 fine‑grained physical phenomena across both first‑person (derived from EPIC‑KITCHENS) and third‑person (derived from WISA‑80K) scenarios. Annotation was performed by 90 STEM‑trained experts using double‑blind labeling and senior expert arbitration, yielding timestamps, error categories, and textual explanations for each sample.
Analysis of the benchmark reveals that 83.3% of third‑person videos and 93.5% of first‑person videos contain at least one physical error that humans can clearly identify. Typical failures include missing or spurious contacts, objects disappearing or appearing out of thin air, broken temporal continuity, incorrect causal order, implausible material or state changes, and impossible geometric collisions.
Concrete examples illustrate these errors: a knife materializes on a desk, liquid defies gravity by flowing upward, water passes through a pot’s bottom, and a pot is lifted by two fingers in an impossible grip. These mistakes go beyond low‑level rendering flaws and directly contradict conservation, gravity, impenetrability, and stable contact principles.
The authors also evaluate ten open‑source and proprietary multimodal‑large‑language‑model (MLLM) critics on Physion‑Eval. Even the strongest model, Gemini 3.0 Pro, misses more than 74.4% of third‑person errors and 90.1% of first‑person errors, often mis‑localizing the error time and fabricating nonexistent causes. This gap indicates that current critics cannot yet replace human judgment for assessing physical consistency.
Further analysis along the dimensions of physical intensity and dynamics shows that highly dynamic, high‑intensity scenes expose model weaknesses more sharply than static ones. Both video generators and MLLM critics perform better on obvious errors but still lag far behind human evaluators.
In conclusion, Physion‑Eval demonstrates that while AI video generators are becoming visually convincing, they have not yet learned the underlying physics of the world. For applications such as world modeling, robotics, embodied AI, and simulation, researchers must shift focus from pure visual quality to solving core problems of object persistence, contact reasoning, state transitions, temporal consistency, and causal structure.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
