Artificial Intelligence 9 min read

Can Multimodal LLMs Truly Understand Human Emotions? Introducing the MME-Emotion Benchmark

This article presents MME-Emotion, a large‑scale multimodal benchmark that evaluates both emotion recognition and reasoning abilities of multimodal large language models across 27 real‑world scenarios, revealing current models’ significant gaps in emotional intelligence and outlining future research directions.

AI Frontier Lectures

Mar 16, 2026

Can Multimodal LLMs Truly Understand Human Emotions? Introducing the MME-Emotion Benchmark

Background

Multimodal large language models (MLLMs) have progressed from image understanding to video analysis and speech dialogue, but it remains unclear whether they can truly comprehend human emotions. Emotions are expressed through facial expressions, vocal tone, and linguistic cues, requiring simultaneous processing of visual, auditory, and textual signals.

MME‑Emotion Benchmark

The Chinese University of Hong Kong and Alibaba Tongyi Lab introduced MME‑Emotion , a comprehensive evaluation benchmark (accepted at ICLR 2026) for emotional intelligence in multimodal LLMs.

~6,500 video clips with question‑answer pairs.

27 real‑world scenarios.

Eight emotion‑related tasks: lab‑environment recognition, real‑scene recognition, noisy‑condition recognition, fine‑grained recognition, multi‑label recognition, sentiment analysis, fine‑grained sentiment analysis, and intent recognition.

Resources:

Project homepage: https://mme-emotion.github.io Code repository: https://github.com/FunAudioLLM/MME-Emotion Dataset:

https://huggingface.co/datasets/Karl28/MME-Emotion

Evaluation Metrics

Recognition Score : accuracy of emotion label prediction.

Reasoning Score : quality of the model’s inferred reasoning steps.

Chain‑of‑Thought Score : combined measure of recognition and reasoning.

Automated Multi‑Agent Evaluation Pipeline

The pipeline automatically collects model responses, extracts reasoning steps, and integrates video frame and audio cues to compute the three scores, greatly reducing manual annotation. Human expert evaluation on a subset showed high agreement with the automated scores.

Experimental Findings

Twenty state‑of‑the‑art multimodal models (including GPT‑4o, Gemini series, and Qwen series) were evaluated. The best model achieved a recognition score below 40 % and a chain‑of‑thought score around 56 %, indicating substantial room for improvement.

Insufficient fine‑grained visual understanding : models often confuse similar emotions such as fear and surprise due to limited perception of subtle facial cues.

Limited multimodal fusion : performance drops when visual and auditory information must be combined, revealing challenges in handling combined emotional signals.

Correlation between reasoning and recognition : models that provide more complete reasoning tend to achieve higher overall emotional‑intelligence scores, suggesting that stronger reasoning mechanisms could boost emotion understanding.

Analysis of Model Limitations

Fine‑grained visual perception : difficulty distinguishing subtle facial expressions (e.g., fear vs. surprise).

Audio‑visual fusion : models perform well with single modalities but degrade when both visual and auditory cues are required.

Reasoning capability : richer reasoning correlates with higher emotional‑intelligence scores, indicating that improving reasoning may raise overall performance.

Future Research Directions

Higher‑precision visual detail modeling.

More effective audio‑visual fusion techniques.

Reasoning mechanisms that can explicitly explain the causes of emotions.

Advancements in emotional intelligence are expected to benefit applications such as education, human‑computer interaction, and medical assistance, where understanding user emotions is crucial.

AI benchmark evaluation dataset multimodal LLM emotional intelligence

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.