Can One Model Master All Audio‑Visual Tasks? Introducing Crab’s Unified Approach

This article presents Crab, a unified audio‑visual scene understanding model that leverages a novel display‑cooperation learning paradigm, introduces the AV‑UIE dataset with explicit reasoning steps, and demonstrates superior performance across temporal, spatial, pixel‑level, and spatio‑temporal tasks through extensive experiments and ablations.

Data Party THU
Data Party THU
Data Party THU
Can One Model Master All Audio‑Visual Tasks? Introducing Crab’s Unified Approach

Overview

The paper proposes Crab , a unified audio‑visual scene understanding model that achieves explicit cooperation among multiple tasks by redesigning both the training data and the model architecture.

Motivation

Humans perceive the world through integrated visual and auditory cues, enabling a general understanding of complex scenes. Existing research typically tackles a single modality or task, limiting the ability to develop a model with human‑like general perception, which is crucial for progress toward AGI.

Current Paradigm and Its Limitations

Modern multimodal large language models rely on massive instruction‑tuning datasets and joint training of diverse tasks. This approach overlooks the heterogeneity of multimodal data and the intricate relationships between tasks, often causing negative interference, especially for disparate audio‑visual tasks.

Proposed Display‑Cooperation Learning Paradigm

Data side: The authors construct AV‑UIE , a novel instruction‑tuning dataset that enriches existing labels with explicit reasoning steps containing temporal and spatial information. Annotation is performed via Gemini 1.5 Pro and subsequently verified by human experts, ensuring high‑quality, reasoning‑aware labels.

Model side: An Interaction‑aware LoRA structure is introduced. It consists of a shared A matrix and multiple task‑specific B heads, each learning a different aspect of multimodal interaction. A router assigns task‑dependent weights to the heads, making inter‑task cooperation explicit.

Dataset Statistics

AV‑UIE covers nine tasks with a total of 200 K training samples: temporal localization (6.8 %), spatial localization (25.8 %), pixel‑level understanding (41.6 %), and spatio‑temporal reasoning (25.8 %). The dataset provides detailed reasoning annotations that bridge tasks.

Model Architecture

Crab features three unified multimodal interfaces for audio, visual, and segmentation‑mask inputs, feeding into a large backbone equipped with the Interaction‑aware LoRA. This design decouples capabilities into distinct heads while sharing core knowledge.

Crab model overall architecture
Crab model overall architecture

Experiments and Analysis

Crab is compared against generic multitask models and task‑specific expert models across all four task categories. It consistently outperforms baselines on most metrics, demonstrating superior generalization.

Ablation studies reveal that naïve multitask LoRA fine‑tuning can degrade performance, whereas the display‑cooperation paradigm mitigates interference and boosts each task’s results.

Visualization of LoRA head weights shows that similar tasks cluster around the same heads, confirming that the model learns distinct yet shared abilities for explicit cooperation.

Visualization of LoRA head weights
Visualization of LoRA head weights

Conclusion

By integrating reasoning‑rich data and a modular Interaction‑aware LoRA, Crab demonstrates effective display cooperation, advancing the state of unified audio‑visual scene understanding and offering a promising direction for future multimodal reasoning research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsLoRABenchmarkmultimodalDatasetaudio-visualscene understanding
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.