Can One Model Master All Audio‑Visual Tasks? Introducing Crab’s Unified Approach
This article presents Crab, a unified audio‑visual scene understanding model that leverages a novel display‑cooperation learning paradigm, introduces the AV‑UIE dataset with explicit reasoning steps, and demonstrates superior performance across temporal, spatial, pixel‑level, and spatio‑temporal tasks through extensive experiments and ablations.
Overview
The paper proposes Crab , a unified audio‑visual scene understanding model that achieves explicit cooperation among multiple tasks by redesigning both the training data and the model architecture.
Motivation
Humans perceive the world through integrated visual and auditory cues, enabling a general understanding of complex scenes. Existing research typically tackles a single modality or task, limiting the ability to develop a model with human‑like general perception, which is crucial for progress toward AGI.
Current Paradigm and Its Limitations
Modern multimodal large language models rely on massive instruction‑tuning datasets and joint training of diverse tasks. This approach overlooks the heterogeneity of multimodal data and the intricate relationships between tasks, often causing negative interference, especially for disparate audio‑visual tasks.
Proposed Display‑Cooperation Learning Paradigm
Data side: The authors construct AV‑UIE , a novel instruction‑tuning dataset that enriches existing labels with explicit reasoning steps containing temporal and spatial information. Annotation is performed via Gemini 1.5 Pro and subsequently verified by human experts, ensuring high‑quality, reasoning‑aware labels.
Model side: An Interaction‑aware LoRA structure is introduced. It consists of a shared A matrix and multiple task‑specific B heads, each learning a different aspect of multimodal interaction. A router assigns task‑dependent weights to the heads, making inter‑task cooperation explicit.
Dataset Statistics
AV‑UIE covers nine tasks with a total of 200 K training samples: temporal localization (6.8 %), spatial localization (25.8 %), pixel‑level understanding (41.6 %), and spatio‑temporal reasoning (25.8 %). The dataset provides detailed reasoning annotations that bridge tasks.
Model Architecture
Crab features three unified multimodal interfaces for audio, visual, and segmentation‑mask inputs, feeding into a large backbone equipped with the Interaction‑aware LoRA. This design decouples capabilities into distinct heads while sharing core knowledge.
Experiments and Analysis
Crab is compared against generic multitask models and task‑specific expert models across all four task categories. It consistently outperforms baselines on most metrics, demonstrating superior generalization.
Ablation studies reveal that naïve multitask LoRA fine‑tuning can degrade performance, whereas the display‑cooperation paradigm mitigates interference and boosts each task’s results.
Visualization of LoRA head weights shows that similar tasks cluster around the same heads, confirming that the model learns distinct yet shared abilities for explicit cooperation.
Conclusion
By integrating reasoning‑rich data and a modular Interaction‑aware LoRA, Crab demonstrates effective display cooperation, advancing the state of unified audio‑visual scene understanding and offering a promising direction for future multimodal reasoning research.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
