How MODA’s Modular Duplex Attention Solves Multimodal Attention Imbalance and Boosts Emotion Understanding
The paper introduces MODA, a modular duplex attention multimodal model that addresses severe cross‑modal attention imbalance in existing large multimodal models, proposes a novel attention paradigm and masking scheme, and demonstrates significant performance gains across 21 benchmarks in perception, cognition, and emotion tasks, earning a Spotlight paper at ICML 2025.
Introduction
Emotion‑aware artificial intelligence, often called "emotion‑intelligent AI," is a key step toward general AI. In human‑machine interaction, digital agents must accurately interpret multimodal cues and infer fine‑grained human emotions to enable natural dialogue. Existing multimodal large models suffer from severe cross‑modal attention imbalance, limiting their ability to capture subtle emotional signals.
Multimodal Attention Imbalance
Analysis of four fine‑grained tasks shows that current models allocate far more attention to the textual modality than to visual inputs, with cross‑modal attention divergence reaching up to 63% and attention scores differing by a factor of ten across layers. This imbalance stems from modality‑biased pre‑training and leads to poor performance on tasks requiring detailed visual or emotional reasoning.
Modular Duplex Attention
To remedy the imbalance, the authors propose a modular duplex attention paradigm that splits multimodal attention into two components: a modality‑alignment branch and a token‑focus correction branch. The alignment branch uses normalized Gram matrices to extract basis vectors for each modality (V‑Aligner for vision, T‑Aligner for text) and aligns token representations across modalities. The correction branch introduces a modular attention mask that stores unnecessary attention scores as pseudo‑scores, limits per‑row attention length, and injects modality‑specific positional priors.
Experiments
Extensive experiments on six task families (general dialogue, knowledge QA, table OCR, visual perception, cognition, emotion) covering 21 benchmarks demonstrate that MODA consistently outperforms baseline multimodal models. The modular duplex attention reduces cross‑modal attention divergence from 56%/62% to 50%/41% and yields large gains in content‑aware, cognitive, and emotional understanding metrics.
Results and Applications
MODA achieves state‑of‑the‑art results on all evaluated datasets, with notable improvements in fine‑grained content perception, role cognition, and emotion understanding. The model has been deployed in Kuaishou’s Kuaishou‑Keling data‑sensing project, enhancing fine‑grained sentiment detection and personalized recommendation. It also powers real‑time human‑machine dialogue scenarios, enabling nuanced intent and emotion recognition for applications such as mental‑health counseling and virtual‑idol interaction.
Conclusion
The modular duplex attention framework provides a general solution for multimodal attention misalignment, can replace existing attention modules in large multimodal models, and significantly enhances multimodal fusion, leading to better perception, cognition, and emotion understanding.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
