Artificial Intelligence 13 min read

How MODA’s Modular Duplex Attention Boosts Multimodal Emotion Understanding

The paper introduces MODA, a new multimodal model that tackles attention imbalance across modalities with a modular duplex attention mechanism, achieving significant performance gains on perception, cognition, and emotion tasks across 21 benchmarks and demonstrating strong potential for human‑machine interaction.

Kuaishou Large Model

Jul 11, 2025

How MODA’s Modular Duplex Attention Boosts Multimodal Emotion Understanding

Research Background

Emotion‑aware artificial intelligence is a key direction toward general AI, requiring digital agents to accurately interpret multimodal interaction cues and infer human emotional states. Existing multimodal large models suffer from severe attention bias toward textual modalities, limiting fine‑grained emotional understanding.

Multimodal Attention Imbalance

Analysis shows that attention scores increasingly favor the text modality as layers deepen, with cross‑modal attention discrepancy reaching up to 63% and attention scores differing by up to tenfold between modalities. This imbalance leads to poor performance on tasks requiring fine‑grained visual perception and emotional inference.

Modular Duplex Attention

To address the imbalance, the authors propose a modular duplex attention paradigm that separates multimodal attention into a modality‑alignment component and a token‑focus correction component. The design introduces V‑Aligner and T‑Aligner modules to align visual and textual tokens via normalized Gram matrices, and a modular attention mask that controls token flow across transformer layers.

Experimental Results

Extensive experiments on six task categories (general dialogue, knowledge QA, table OCR, visual perception, cognition, and emotion) across 21 benchmarks show that MODA consistently outperforms baseline models. The modular duplex attention reduces cross‑modal attention discrepancy from 56%/62% to 50%/41% and yields notable gains in content awareness, cognitive analysis, and emotional understanding.

Conclusion

The MODA model demonstrates that a modular duplex attention mechanism can effectively mitigate multimodal attention misalignment, enhance fine‑grained perception, and improve human‑machine dialogue quality, offering a versatile foundation for future multimodal AI applications.

multimodal AI deep learning Attention Mechanisms emotion understanding MODA model

Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.