Join the ACM MM 2026 EgoLink Challenge to Advance Egocentric Reasoning
The ACM MM 2026 EgoLink Grand Challenge invites researchers to tackle egocentric video understanding by evaluating social reasoning, causal inference, intent prediction, and multimodal interaction, offering two tracks that test perception‑reasoning‑action loops on real‑world first‑person datasets.
Motivation
First‑person (egocentric) perception is a prerequisite for embodied AI to perceive and act in real‑world social environments. Two concrete challenges motivate the EgoLink challenge:
Bridging video description and social cognition. Current multimodal models excel at scene description but struggle with long‑range causal reasoning, fine‑grained emotion understanding, and the dynamics of interpersonal interaction from egocentric video.
Closing the perception‑to‑action gap. Traditional benchmarks focus on static visual QA, ignoring the closed‑loop of perception, reasoning, decision‑making and execution required for embodied agents in real social scenarios.
Innovation
Comprehensive integrated intelligence evaluation : measures emotion perception, causal understanding, intent prediction, tool use, and autonomous planning.
Real‑world scenarios : built on authentic first‑person social life recordings, including data captured by AI glasses.
Perception‑reasoning‑action coupling : requires tight end‑to‑end integration of perception, reasoning, and decision making in unstructured environments.
Track 1 – Social Reasoning in Egocentric Video
This track evaluates models on social reasoning rather than basic navigation or object detection. It uses the E3 (Exploring Embodied Emotion) dataset to pose multiple‑choice questions that test emotional perception, causal inference, behavioral intent prediction, and egocentric semantic summarization.
Core evaluation dimensions :
Emotional perception and localization : identify emotion categories, temporal boundaries, intensity, and trajectories in egocentric video streams.
Social causal reasoning : analyze causal triggers behind observed social emotions and reactions, link eye‑gaze, body language, and speech to emotions, and infer deeper motivations.
Behavioral intent prediction : infer possible future intents and goals from multimodal context, predict actions based on current behavior and emotional state, and understand underlying social strategies.
Egocentric semantic summarization : generate high‑level summaries that capture the social plot, relationship types, and emotional tone from a first‑person perspective.
Track 2 – Interactive Agent in Social Life Scenarios
This track evaluates an interactive agent that must solve real‑world tasks in dynamic social environments through multimodal dialogue and tool use. The agent receives egocentric video streams (e.g., shopping, ordering), natural‑language instructions, and a set of external tools. It must engage in multi‑turn dialogue, construct correct tool inputs from visual evidence, interpret tool outputs, and close the execution loop.
Core evaluation dimensions :
Fine‑grained egocentric visual understanding : temporal and spatial perception of object states, attribute recognition (color, shape, texture, brand) from continuous video or image sequences.
Dynamic tool invocation and execution : decide when and which tool to invoke, build correct input parameters from visual evidence (e.g., price on screen, menu option), interpret tool output and choose the next action (complete, retry, or switch strategy).
Multi‑hop logical reasoning and complex decision making : handle conditional reasoning (“if … then …”) and multi‑reference resolution in long dialogues and multi‑object scenes (e.g., “the red one” or “the one we saw earlier”).
Schedule
2026‑04‑15: E3 video dataset open for pre‑download.
2026‑04‑25: Registration opens.
2026‑06‑08: Test set released.
2026‑06‑22 to 06‑25: Final answer and report submission window for Track 1.
2026‑06‑15 to 06‑25: Final answer and report submission window for Track 2.
Resources
Dataset download (Track 1 training videos): https://github.com/Exploring-Embodied-Emotion-official/E3/blob/main/dataset/README.MD
Track 2 participation guide (PDF): https://github.com/ego-link/egolink2026/blob/main/doc/track2/Track2_Evaluation_Submission_Guide_ZH.pdf
Challenge homepage: https://ego-link.github.io/challenge2026
Reference
[1] Wang Lin, Yueying Feng, Wenkang Han, Tao Jin, Zhou Zhao, Fei Wu, Chang Yao, and Jingyuan Chen. E3: Exploring Embodied Emotion Through a Large‑Scale Egocentric Video Dataset . NeurIPS Datasets and Benchmarks Track, 2024.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
