Can a Tree‑Reasoned Model Master Video Emotion Understanding?
The paper introduces VidEmo, a multimodal video foundation model that uses a two‑stage emotion‑clue‑guided reasoning framework and a large emotion‑centric dataset (Emo‑CFG) to achieve state‑of‑the‑art performance on facial attribute, expression, and fine‑grained emotion tasks, surpassing Gemini 2.0.
Research Background
Understanding and predicting human emotions from dynamic videos is essential for human‑computer interaction, surveillance, and healthcare. Existing methods perform well on basic emotion classification but struggle with the dynamic, context‑dependent nature of emotions. Current video foundation models lack high‑level emotional reasoning.
Problem Statement
State‑of‑the‑art video models such as Gemini 2.0 achieve only 26.3% accuracy on fine‑grained emotion analysis, indicating a large performance gap for modeling the relationship between facial cues and complex emotional states.
Proposed Method: VidEmo Framework
VidEmo introduces an emotion‑clue‑guided reasoning framework that unifies three core components in a staged manner:
Basic attribute perception
Facial expression analysis
High‑level emotion understanding
The training consists of two phases.
Pre‑training (Curriculum Emotion Learning)
A curriculum gradually injects emotional knowledge: first learn facial attributes, then expressions, and finally complex emotions. This balances task difficulty and reduces confusion between similar cues.
Post‑training (Emotion‑Tree Reinforcement Learning)
An emotion‑tree structure refines reasoning. The model samples outputs, evaluates them with multiple reward signals (group‑wise advantage, QA accuracy/F1, short‑description quality, and tree‑edit distance), and updates the policy via GRPO‑style reinforcement learning.
Emotion‑Centric Dataset: Emo‑CFG
Emo‑CFG is a fine‑grained video instruction dataset containing 2.1 million samples covering attribute perception, expression analysis, and emotion understanding. Data are aggregated from 17 public video datasets and retain metadata such as face bounding boxes, duration, resolution, and frame rate. Captions and QA pairs are generated with Gemini 2.0 and GPT‑4o, followed by a critic‑based voting mechanism to ensure label quality.
Training Procedure
During pre‑training, the model follows a three‑stage curriculum: (I) attribute adjustment, (II) expression adjustment, (III) emotion adjustment. In post‑training, the current policy samples a batch of outputs, which are scored by the reward components described above, and the policy is updated to maximize the combined objective.
Experimental Results
VidEmo outperforms 18 mainstream video LLMs on 15 facial perception tasks, including attribute perception, expression analysis, and fine‑grained emotion understanding. It achieves the highest scores on public emotion classification benchmarks DFEW and MAFW and shows superior performance on the Emo‑CFG test set across all three task categories.
Resources
Paper: https://arxiv.org/html/2511.02712
Project page: https://zzcheng.top/VidEmo
Model checkpoints: https://huggingface.co/KlingTeam/VidEmo-3B , https://huggingface.co/KlingTeam/VidEmo-7B
Dataset: https://huggingface.co/datasets/KlingTeam/Emo-CFG
Code repository: https://github.com/KlingTeam/VidEmo
Git clone command:
git clone https://github.com/KlingTeam/VidEmo.gitFigures
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
