Can Multimodal LLMs Truly Understand Emotions? Inside the MME-Emotion Benchmark

The MME-Emotion benchmark, introduced by researchers from CUHK and Alibaba Tongyi and accepted at ICLR 2026, provides a large‑scale, multimodal evaluation of emotional intelligence in large language models, revealing current models’ limited emotion recognition and reasoning abilities across diverse real‑world scenarios.

Data Party THU
Data Party THU
Data Party THU
Can Multimodal LLMs Truly Understand Emotions? Inside the MME-Emotion Benchmark

Background

Multimodal large language models (MLLMs) have progressed from image understanding to video analysis and speech dialogue. A key open question is whether these models can truly comprehend human emotions, which are expressed through a combination of facial expressions, vocal tone, and language.

MME-Emotion Benchmark

The Chinese University of Hong Kong and Alibaba Tongyi Lab introduced MME-Emotion , a large‑scale evaluation suite for emotional intelligence in MLLMs. The benchmark contains roughly 6,500 video clips with question‑answer pairs, covering 27 real‑world scenarios and eight distinct emotion‑related tasks.

Key resources:

Paper: "MME-Emotion: A Holistic Evaluation Benchmark For Emotional Intelligence in Multimodal Large Language Models"

Project homepage: https://mme-emotion.github.io

Code repository: https://github.com/FunAudioLLM/MME-Emotion

Dataset: https://huggingface.co/datasets/Karl28/MME-Emotion

Benchmark overview
Benchmark overview

Benchmark Design

The benchmark defines eight tasks, including laboratory‑environment emotion recognition, real‑world emotion recognition, noisy‑condition recognition, fine‑grained emotion classification, multi‑label emotion detection, sentiment polarity analysis, fine‑grained sentiment analysis, and intent recognition. Data distribution is balanced across tasks to ensure stable evaluation.

Unlike prior datasets that only measure label accuracy, MME-Emotion jointly evaluates emotion recognition and emotional reasoning . For example, when a video shows fear, the model must output the label "fear" and also cite supporting cues such as facial tension, voice tremor, or speech rate changes.

Evaluation Methodology

Three unified metrics are used:

Recognition Score : accuracy of predicted emotion labels.

Reasoning Score : quality and relevance of the model’s explanatory steps.

Chain‑of‑Thought Score : combined assessment of recognition and reasoning.

The evaluation pipeline employs a multi‑agent system that automatically:

Collects model responses to benchmark questions.

Extracts key reasoning steps from the textual output.

Aligns extracted steps with visual frames and audio cues from the video.

Computes the three scores without extensive manual annotation.

Human experts reviewed a subset of samples; automatic scores showed high consistency with human judgments, confirming reliability.

Evaluation pipeline
Evaluation pipeline

Results

Twenty state‑of‑the‑art multimodal models were evaluated, including open‑source and proprietary systems such as GPT‑4o, Gemini series, and Qwen series. The best model achieved less than 40% on the Recognition Score and about 56% on the Chain‑of‑Thought Score, indicating substantial gaps in emotional understanding.

Common failure patterns were identified:

Insufficient fine‑grained visual understanding : models often confuse similar emotions (e.g., fear vs. surprise) due to limited perception of subtle facial cues.

Limited multimodal fusion : performance drops when visual, auditory, and textual signals must be integrated simultaneously.

Reasoning‑recognition correlation : models that provide richer, more coherent reasoning tend to obtain higher overall emotional intelligence scores.

Result analysis
Result analysis

Challenges

Current MLLMs excel at visual and linguistic tasks but struggle with the nuanced, multimodal cues required for accurate emotion detection and explanation. Specific challenges include:

Capturing subtle facial dynamics and micro‑expressions.

Robust audio‑visual fusion under noisy or real‑world conditions.

Reasoning mechanisms that can explicitly link observed cues to emotional states.

Future Directions

Advancing multimodal emotional intelligence is likely to require:

Higher‑resolution visual modeling to capture fine‑grained facial movements.

More effective fusion architectures that jointly process video, audio, and text.

Explicit reasoning modules that generate traceable explanations for emotion predictions.

Improvements in these areas could enable applications in education, human‑computer interaction, and medical assistance where understanding user emotions is critical.

Conclusion

MME-Emotion provides the first large‑scale, holistic benchmark for assessing emotional intelligence in multimodal large models, establishing a clear baseline and a roadmap for future research.

Code example

来源:人工智能前沿讲习
本文
约2000字
,建议阅读
5
分钟
本
文介绍了 MME-Emotion 评测基准,用于全面衡量多模态大模型情感智能。
AIbenchmarkevaluationmultimodal LLMemotional intelligenceMME-Emotion
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.