Artificial Intelligence 10 min read

Join the ACM MM 2026 EgoLink Challenge to Advance Egocentric Reasoning

The ACM MM 2026 EgoLink Grand Challenge invites researchers to tackle egocentric video understanding by evaluating social reasoning, causal inference, intent prediction, and multimodal interaction, offering two tracks that test perception‑reasoning‑action loops on real‑world first‑person datasets.

AntTech

May 8, 2026

Join the ACM MM 2026 EgoLink Challenge to Advance Egocentric Reasoning

Motivation

First‑person (egocentric) perception is a prerequisite for embodied AI to perceive and act in real‑world social environments. Two concrete challenges motivate the EgoLink challenge:

Bridging video description and social cognition. Current multimodal models excel at scene description but struggle with long‑range causal reasoning, fine‑grained emotion understanding, and the dynamics of interpersonal interaction from egocentric video.

Closing the perception‑to‑action gap. Traditional benchmarks focus on static visual QA, ignoring the closed‑loop of perception, reasoning, decision‑making and execution required for embodied agents in real social scenarios.

Innovation

Comprehensive integrated intelligence evaluation : measures emotion perception, causal understanding, intent prediction, tool use, and autonomous planning.

Real‑world scenarios : built on authentic first‑person social life recordings, including data captured by AI glasses.

Perception‑reasoning‑action coupling : requires tight end‑to‑end integration of perception, reasoning, and decision making in unstructured environments.

Track 1 – Social Reasoning in Egocentric Video

This track evaluates models on social reasoning rather than basic navigation or object detection. It uses the E3 (Exploring Embodied Emotion) dataset to pose multiple‑choice questions that test emotional perception, causal inference, behavioral intent prediction, and egocentric semantic summarization.

Core evaluation dimensions :

Emotional perception and localization : identify emotion categories, temporal boundaries, intensity, and trajectories in egocentric video streams.

Social causal reasoning : analyze causal triggers behind observed social emotions and reactions, link eye‑gaze, body language, and speech to emotions, and infer deeper motivations.

Behavioral intent prediction : infer possible future intents and goals from multimodal context, predict actions based on current behavior and emotional state, and understand underlying social strategies.

Egocentric semantic summarization : generate high‑level summaries that capture the social plot, relationship types, and emotional tone from a first‑person perspective.

Track 2 – Interactive Agent in Social Life Scenarios

This track evaluates an interactive agent that must solve real‑world tasks in dynamic social environments through multimodal dialogue and tool use. The agent receives egocentric video streams (e.g., shopping, ordering), natural‑language instructions, and a set of external tools. It must engage in multi‑turn dialogue, construct correct tool inputs from visual evidence, interpret tool outputs, and close the execution loop.

Core evaluation dimensions :

Fine‑grained egocentric visual understanding : temporal and spatial perception of object states, attribute recognition (color, shape, texture, brand) from continuous video or image sequences.

Dynamic tool invocation and execution : decide when and which tool to invoke, build correct input parameters from visual evidence (e.g., price on screen, menu option), interpret tool output and choose the next action (complete, retry, or switch strategy).

Multi‑hop logical reasoning and complex decision making : handle conditional reasoning (“if … then …”) and multi‑reference resolution in long dialogues and multi‑object scenes (e.g., “the red one” or “the one we saw earlier”).

Schedule

2026‑04‑15: E3 video dataset open for pre‑download.

2026‑04‑25: Registration opens.

2026‑06‑08: Test set released.

2026‑06‑22 to 06‑25: Final answer and report submission window for Track 1.

2026‑06‑15 to 06‑25: Final answer and report submission window for Track 2.

Resources

Dataset download (Track 1 training videos): https://github.com/Exploring-Embodied-Emotion-official/E3/blob/main/dataset/README.MD

Track 2 participation guide (PDF): https://github.com/ego-link/egolink2026/blob/main/doc/track2/Track2_Evaluation_Submission_Guide_ZH.pdf

Challenge homepage: https://ego-link.github.io/challenge2026

Reference

[1] Wang Lin, Yueying Feng, Wenkang Han, Tao Jin, Zhou Zhao, Fei Wu, Chang Yao, and Jingyuan Chen. E3: Exploring Embodied Emotion Through a Large‑Scale Egocentric Video Dataset . NeurIPS Datasets and Benchmarks Track, 2024.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Embodied AI multimodal challenge social reasoning ACM MM 2026 egocentric video

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Motivation

Innovation

Track 1 – Social Reasoning in Egocentric Video

Track 2 – Interactive Agent in Social Life Scenarios

Schedule

Resources

AntTech

How this landed with the community

Was this worth your time?

0 Comments

Track 1 – Social Reasoning in Egocentric Video

Track 2 – Interactive Agent in Social Life Scenarios