Can One Model Master All Audio‑Visual Tasks? Introducing Crab’s Unified Approach
Researchers from RUC, Tsinghua, and Tencent present Crab, a unified audio‑visual scene understanding model that leverages explicit cooperation and a new AV‑UIE dataset with visible reasoning steps, achieving state‑of‑the‑art performance across temporal, spatial, pixel‑level, and spatio‑temporal tasks.
Motivation
Human perception integrates visual and auditory cues, yet most multimodal research tackles single tasks. Achieving a human‑like, general understanding of audio‑visual scenes is essential for progress toward artificial general intelligence (AGI). Current dominant paradigms—building large instruction‑tuned multimodal datasets and then jointly fine‑tuning a model—ignore the heterogeneity of multimodal data and the complex relationships among tasks, leading to interference when tasks are simply combined.
Proposed Paradigm
The authors introduce a new paradigm that explicitly encourages cooperation between tasks, addressing the problem from both data and model perspectives.
AV‑UIE: Instruction‑Tuned Dataset with Visible Reasoning
AV‑UIE augments existing audio‑visual datasets by adding step‑by‑step reasoning annotations that contain temporal and spatial information. The annotations are generated via in‑context learning with Gemini 1.5 Pro and then manually verified. The dataset covers nine tasks—temporal localization, spatial localization, pixel‑level understanding, and spatio‑temporal reasoning—totaling 200 K training samples. By providing explicit reasoning, the dataset guides the model to learn distinct capabilities that can be shared across tasks.
Crab: Unified Audio‑Visual Scene Understanding Model
Crab implements a unified learning framework consisting of three multimodal interfaces (audio, visual, and segmentation‑mask) that feed into a large language model (LLM). The LLM is equipped with an Interaction‑aware LoRA (Low‑Rank Adaptation) module. This module contains a shared A matrix and multiple task‑specific B heads. A router dynamically assigns weights to the heads for each task, allowing the model to decouple abilities (e.g., temporal localization, spatial localization) and enable explicit inter‑task assistance.
Experiments and Analysis
Comprehensive experiments compare Crab with both multi‑task generalist models and single‑task specialist models across all four task categories. Crab consistently outperforms baselines, demonstrating superior generalization and task‑specific gains.
Table 1: Comparison with multi‑task generalist models (Crab achieves higher scores on all tasks).
Tables 2–5: Comparisons with specialist models for temporal localization, spatial localization, pixel‑level understanding, and spatio‑temporal reasoning respectively (Crab matches or exceeds specialist performance).
Table 6: Ablation study showing that naive multi‑task LoRA fine‑tuning harms performance, while the explicit cooperation paradigm mitigates interference and improves each task.
LoRA Head Visualization
Weight visualizations for three LoRA heads across tasks reveal clustering of similar tasks and distinct head preferences, confirming that each head captures a specific capability and that tasks can share heads to achieve mutual assistance.
Conclusion
The study demonstrates that a dataset with explicit reasoning and a model architecture featuring interaction‑aware LoRA can realize visible inter‑task cooperation, leading to a unified audio‑visual understanding system that surpasses both generalist and specialist baselines. Future work will explore new multimodal reasoning paradigms to further advance the field.
Paper: https://arxiv.org/abs/2503.13068
Project page: https://github.com/GeWu-Lab/Crab
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
