Can One Model Master All Audio‑Visual Tasks? Introducing Crab’s Unified Approach

Researchers from RUC, Tsinghua, and Tencent present Crab, a unified audio‑visual scene understanding model that leverages explicit cooperation and a new AV‑UIE dataset with visible reasoning steps, achieving state‑of‑the‑art performance across temporal, spatial, pixel‑level, and spatio‑temporal tasks.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
Can One Model Master All Audio‑Visual Tasks? Introducing Crab’s Unified Approach

Motivation

Human perception integrates visual and auditory cues, yet most multimodal research tackles single tasks. Achieving a human‑like, general understanding of audio‑visual scenes is essential for progress toward artificial general intelligence (AGI). Current dominant paradigms—building large instruction‑tuned multimodal datasets and then jointly fine‑tuning a model—ignore the heterogeneity of multimodal data and the complex relationships among tasks, leading to interference when tasks are simply combined.

Proposed Paradigm

The authors introduce a new paradigm that explicitly encourages cooperation between tasks, addressing the problem from both data and model perspectives.

AV‑UIE: Instruction‑Tuned Dataset with Visible Reasoning

AV‑UIE augments existing audio‑visual datasets by adding step‑by‑step reasoning annotations that contain temporal and spatial information. The annotations are generated via in‑context learning with Gemini 1.5 Pro and then manually verified. The dataset covers nine tasks—temporal localization, spatial localization, pixel‑level understanding, and spatio‑temporal reasoning—totaling 200 K training samples. By providing explicit reasoning, the dataset guides the model to learn distinct capabilities that can be shared across tasks.

AV‑UIE construction pipeline and statistics
AV‑UIE construction pipeline and statistics

Crab: Unified Audio‑Visual Scene Understanding Model

Crab implements a unified learning framework consisting of three multimodal interfaces (audio, visual, and segmentation‑mask) that feed into a large language model (LLM). The LLM is equipped with an Interaction‑aware LoRA (Low‑Rank Adaptation) module. This module contains a shared A matrix and multiple task‑specific B heads. A router dynamically assigns weights to the heads for each task, allowing the model to decouple abilities (e.g., temporal localization, spatial localization) and enable explicit inter‑task assistance.

Crab overall architecture
Crab overall architecture

Experiments and Analysis

Comprehensive experiments compare Crab with both multi‑task generalist models and single‑task specialist models across all four task categories. Crab consistently outperforms baselines, demonstrating superior generalization and task‑specific gains.

Table 1: Comparison with multi‑task generalist models (Crab achieves higher scores on all tasks).

Tables 2–5: Comparisons with specialist models for temporal localization, spatial localization, pixel‑level understanding, and spatio‑temporal reasoning respectively (Crab matches or exceeds specialist performance).

Table 6: Ablation study showing that naive multi‑task LoRA fine‑tuning harms performance, while the explicit cooperation paradigm mitigates interference and improves each task.

Results comparison with generalist models
Results comparison with generalist models
Temporal localization results
Temporal localization results
Spatial localization results
Spatial localization results
Pixel‑level understanding results
Pixel‑level understanding results
Spatio‑temporal reasoning results
Spatio‑temporal reasoning results
Ablation study results
Ablation study results

LoRA Head Visualization

Weight visualizations for three LoRA heads across tasks reveal clustering of similar tasks and distinct head preferences, confirming that each head captures a specific capability and that tasks can share heads to achieve mutual assistance.

LoRA head weight visualization
LoRA head weight visualization

Conclusion

The study demonstrates that a dataset with explicit reasoning and a model architecture featuring interaction‑aware LoRA can realize visible inter‑task cooperation, leading to a unified audio‑visual understanding system that surpasses both generalist and specialist baselines. Future work will explore new multimodal reasoning paradigms to further advance the field.

Paper: https://arxiv.org/abs/2503.13068

Project page: https://github.com/GeWu-Lab/Crab

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LoRAaudio-visualscene understanding
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.