Can AI Really Understand Dynamic First‑Person Scenes? Inside the New EOC‑Bench
The article introduces EOC‑Bench, a pioneering benchmark that evaluates multimodal large language models on dynamic first‑person visual tasks across past, present, and future time dimensions, presents its 3,277 questions, novel multi‑scale temporal accuracy metric, extensive model comparisons, and detailed error analysis revealing current models’ limitations in temporal perception and memory.
Introduction
The authors, an intern at Alibaba DAMO Academy, present EOC‑Bench, the first benchmark focused on evaluating multimodal large language models (MLLMs) in dynamic, first‑person visual scenarios where temporal understanding is essential.
Why Existing Models Struggle
Current vision‑language models are trained on large static image‑text corpora and lack the ability to perceive and reason about dynamic worlds. Consequently, they fail to answer questions such as whether a dish has been cooked long enough or whether a pot is already turned off.
EOC‑Bench Overview
EOC‑Bench contains 3,277 question‑answer pairs covering 11 categories and four answer types (true/false, single‑choice, multiple‑choice, open‑ended). It provides a project homepage, code repository, and a HuggingFace dataset.
Temporal Question Dimensions
Past : Requires recalling historical states (e.g., "How long has the water been boiling?") and includes sub‑tasks such as object‑state retrospection, object‑location retrospection, object‑relationship evolution, and absolute time perception.
Present : Involves resisting visual deception and immediate state recognition, purpose inference, object‑relationship identification, and anomaly detection.
Future : Predicts dynamic risks, state changes, relationship evolution, and trajectory motion.
Evaluation Metrics
The benchmark introduces a Multi‑Scale Temporal Accuracy (MSTA) metric that measures temporal precision with adjustable thresholds, balancing strictness and flexibility.
Model Evaluation
More than 20 open‑source and closed‑source models were tested, including GPT‑4o, GPT‑4o‑mini, Gemini‑2.0‑flash, Qwen series, InternVL series, VideoLLaMA series, LLaVA series, VideoRefer, Osprey, SPHINX‑V, and ViP‑LLaVA.
Key findings:
Most models perform poorly on object‑relationship retrospection (ORE) and absolute time perception (ATP), often below random guessing.
Adding timestamps improves GPT‑4o and Gemini‑2.0‑flash performance dramatically, especially on past‑oriented tasks (+49.2% and +60.2%).
Larger models (e.g., 72B parameters) handle future prediction better than smaller ones.
Error Analysis
Using GPT‑4o as a case study, errors are categorized into perception errors, memory errors, relationship‑reasoning errors, and knowledge errors.
Perception errors : Frame‑level visual confusion, counting mistakes, and intra‑frame interference.
Memory errors : Failure to recall previous frames, dominating 93% of past‑category mistakes.
Relationship‑reasoning errors : Difficulty inferring simple object relations.
Knowledge errors : Mistakes in commonsense, calculation, or factual reasoning.
Temporal Accuracy Distribution
Density analysis shows human answers cluster tightly with low error, while top models exhibit broader, more random distributions, indicating limited temporal perception.
Conclusion
EOC‑Bench provides a comprehensive evaluation of MLLMs’ object‑level cognition in dynamic, ego‑centric scenes across past, present, and future dimensions. The benchmark’s diverse question types and multi‑scale temporal metric expose significant gaps in current models, especially in memory and absolute time perception, and aim to drive future research toward more robust embodied AI.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
