Can AI Really Understand Dynamic First‑Person Scenes? Inside the New EOC‑Bench

The article introduces EOC‑Bench, a pioneering benchmark that evaluates multimodal large language models on dynamic first‑person visual tasks across past, present, and future time dimensions, presents its 3,277 questions, novel multi‑scale temporal accuracy metric, extensive model comparisons, and detailed error analysis revealing current models’ limitations in temporal perception and memory.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
Can AI Really Understand Dynamic First‑Person Scenes? Inside the New EOC‑Bench

Introduction

The authors, an intern at Alibaba DAMO Academy, present EOC‑Bench, the first benchmark focused on evaluating multimodal large language models (MLLMs) in dynamic, first‑person visual scenarios where temporal understanding is essential.

Why Existing Models Struggle

Current vision‑language models are trained on large static image‑text corpora and lack the ability to perceive and reason about dynamic worlds. Consequently, they fail to answer questions such as whether a dish has been cooked long enough or whether a pot is already turned off.

EOC‑Bench Overview

EOC‑Bench contains 3,277 question‑answer pairs covering 11 categories and four answer types (true/false, single‑choice, multiple‑choice, open‑ended). It provides a project homepage, code repository, and a HuggingFace dataset.

EOC‑Bench illustration
EOC‑Bench illustration

Temporal Question Dimensions

Past : Requires recalling historical states (e.g., "How long has the water been boiling?") and includes sub‑tasks such as object‑state retrospection, object‑location retrospection, object‑relationship evolution, and absolute time perception.

Present : Involves resisting visual deception and immediate state recognition, purpose inference, object‑relationship identification, and anomaly detection.

Future : Predicts dynamic risks, state changes, relationship evolution, and trajectory motion.

Temporal dimensions diagram
Temporal dimensions diagram

Evaluation Metrics

The benchmark introduces a Multi‑Scale Temporal Accuracy (MSTA) metric that measures temporal precision with adjustable thresholds, balancing strictness and flexibility.

MSTA formula
MSTA formula

Model Evaluation

More than 20 open‑source and closed‑source models were tested, including GPT‑4o, GPT‑4o‑mini, Gemini‑2.0‑flash, Qwen series, InternVL series, VideoLLaMA series, LLaVA series, VideoRefer, Osprey, SPHINX‑V, and ViP‑LLaVA.

Model performance table
Model performance table

Key findings:

Most models perform poorly on object‑relationship retrospection (ORE) and absolute time perception (ATP), often below random guessing.

Adding timestamps improves GPT‑4o and Gemini‑2.0‑flash performance dramatically, especially on past‑oriented tasks (+49.2% and +60.2%).

Larger models (e.g., 72B parameters) handle future prediction better than smaller ones.

Error Analysis

Using GPT‑4o as a case study, errors are categorized into perception errors, memory errors, relationship‑reasoning errors, and knowledge errors.

Perception errors : Frame‑level visual confusion, counting mistakes, and intra‑frame interference.

Memory errors : Failure to recall previous frames, dominating 93% of past‑category mistakes.

Relationship‑reasoning errors : Difficulty inferring simple object relations.

Knowledge errors : Mistakes in commonsense, calculation, or factual reasoning.

Error type distribution
Error type distribution

Temporal Accuracy Distribution

Density analysis shows human answers cluster tightly with low error, while top models exhibit broader, more random distributions, indicating limited temporal perception.

Temporal accuracy histogram
Temporal accuracy histogram

Conclusion

EOC‑Bench provides a comprehensive evaluation of MLLMs’ object‑level cognition in dynamic, ego‑centric scenes across past, present, and future dimensions. The benchmark’s diverse question types and multi‑scale temporal metric expose significant gaps in current models, especially in memory and absolute time perception, and aim to drive future research toward more robust embodied AI.

multimodal AItemporal reasoningdynamic perceptionMLLM evaluation
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.