Artificial Intelligence 15 min read

Can a Tree‑Reasoned Model Master Video Emotion Understanding?

The paper introduces VidEmo, a multimodal video foundation model that uses a two‑stage emotion‑clue‑guided reasoning framework and a large emotion‑centric dataset (Emo‑CFG) to achieve state‑of‑the‑art performance on facial attribute, expression, and fine‑grained emotion tasks, surpassing Gemini 2.0.

Kuaishou Tech

Dec 4, 2025

Can a Tree‑Reasoned Model Master Video Emotion Understanding?

Research Background

Understanding and predicting human emotions from dynamic videos is essential for human‑computer interaction, surveillance, and healthcare. Existing methods perform well on basic emotion classification but struggle with the dynamic, context‑dependent nature of emotions. Current video foundation models lack high‑level emotional reasoning.

Problem Statement

State‑of‑the‑art video models such as Gemini 2.0 achieve only 26.3% accuracy on fine‑grained emotion analysis, indicating a large performance gap for modeling the relationship between facial cues and complex emotional states.

Proposed Method: VidEmo Framework

VidEmo introduces an emotion‑clue‑guided reasoning framework that unifies three core components in a staged manner:

Basic attribute perception

Facial expression analysis

High‑level emotion understanding

The training consists of two phases.

Pre‑training (Curriculum Emotion Learning)

A curriculum gradually injects emotional knowledge: first learn facial attributes, then expressions, and finally complex emotions. This balances task difficulty and reduces confusion between similar cues.

Post‑training (Emotion‑Tree Reinforcement Learning)

An emotion‑tree structure refines reasoning. The model samples outputs, evaluates them with multiple reward signals (group‑wise advantage, QA accuracy/F1, short‑description quality, and tree‑edit distance), and updates the policy via GRPO‑style reinforcement learning.

Emotion‑Centric Dataset: Emo‑CFG

Emo‑CFG is a fine‑grained video instruction dataset containing 2.1 million samples covering attribute perception, expression analysis, and emotion understanding. Data are aggregated from 17 public video datasets and retain metadata such as face bounding boxes, duration, resolution, and frame rate. Captions and QA pairs are generated with Gemini 2.0 and GPT‑4o, followed by a critic‑based voting mechanism to ensure label quality.

Training Procedure

During pre‑training, the model follows a three‑stage curriculum: (I) attribute adjustment, (II) expression adjustment, (III) emotion adjustment. In post‑training, the current policy samples a batch of outputs, which are scored by the reward components described above, and the policy is updated to maximize the combined objective.

Experimental Results

VidEmo outperforms 18 mainstream video LLMs on 15 facial perception tasks, including attribute perception, expression analysis, and fine‑grained emotion understanding. It achieves the highest scores on public emotion classification benchmarks DFEW and MAFW and shows superior performance on the Emo‑CFG test set across all three task categories.

Resources

Paper: https://arxiv.org/html/2511.02712

Project page: https://zzcheng.top/VidEmo

Model checkpoints: https://huggingface.co/KlingTeam/VidEmo-3B , https://huggingface.co/KlingTeam/VidEmo-7B

Dataset: https://huggingface.co/datasets/KlingTeam/Emo-CFG

Code repository: https://github.com/KlingTeam/VidEmo

Git clone command:

git clone https://github.com/KlingTeam/VidEmo.git

Figures

computer vision AI multimodal Dataset foundation model emotion tree reasoning video emotion

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.