Omni2Sound Beats Multi-Modal Audio ‘Generalist’ Dilemma via Data Alignment
Omni2Sound tackles the long‑standing “generalist” dilemma of unified audio generation by constructing a high‑quality V‑T‑A dataset (SoundAtlas), employing a three‑stage progressive training pipeline, and using a simple Diffusion Transformer backbone, ultimately achieving state‑of‑the‑art performance on T2A, V2A and VT2A tasks and strong robustness on off‑screen scenarios.
Background
Audio generation has rapidly progressed, shifting from single‑condition control to multimodal collaborative control. Researchers aim for a unified model that can handle text‑to‑audio (T2A), video‑to‑audio (V2A), and video‑text‑to‑audio (VT2A) within a single architecture.
The Generalist Dilemma
Unified models often underperform specialist models on individual sub‑tasks because multimodal audio generation is not a simple linear fusion of visual and textual features. The dilemma stems from two major challenges:
Asymmetric cross‑modal information and dynamic routing. In a typical scene—e.g., a quiet student studying while a mosquito buzzes—the mosquito occupies a single pixel visually but dominates the audio spectrum. A video‑only model would miss the buzzing sound, requiring strong dynamic routing to let text dictate the audio while video only provides timing.
Semantic conflict and off‑screen reasoning. When visual and textual inputs contradict (e.g., a calm coffee‑drinking scene paired with the instruction “a huge explosion outside”), naïve fusion leads to incoherent audio. The model must recognize the off‑screen scenario, suppress irrelevant visual cues, and rely on the textual command.
Root Causes in Data and Training
Two fundamental problems cause the dilemma:
Data misalignment. Audio is inherently ambiguous; different visual events can share similar acoustic signatures (e.g., sizzling meat vs. heavy rain). Early audio‑language datasets generated text automatically, leading to high hallucination rates and severe video‑audio‑text mismatches.
Task competition. Joint optimization of T2A and V2A creates cross‑task competition, where improving one task often degrades the other, and intra‑task modality bias pushes the model to over‑rely on either text or video.
Omni2Sound’s Breakthrough Strategy
Omni2Sound adopts a “Less is More” philosophy: instead of complex custom networks, it uses a vanilla Diffusion Transformer backbone and focuses on three pillars—high‑quality data, progressive multi‑task training, and comprehensive evaluation.
1. High‑Quality V‑T‑A Data (SoundAtlas)
To resolve semantic misalignment, the team built a 470 k pair V‑T‑A dataset called SoundAtlas . They replaced raw video input with a vision‑to‑language compression step using a visual model (e.g., Qwen‑2.5‑VL) to generate concise textual descriptions, dramatically reducing token cost. An agentic pipeline with a junior lightweight model and a senior stronger model performs multi‑round annotation, cutting data generation cost by ~5× while achieving alignment quality surpassing human‑expert annotations.
2. Three‑Stage Progressive Training
Stage 1 – Large‑scale T2A pre‑training. The model first learns robust audio generation priors from massive text‑audio data, preventing catastrophic forgetting when later tasks are introduced.
Stage 2 – Interleaved multi‑task training. Task‑balanced sampling avoids gradient conflicts between T2A and V2A. High‑quality VT2A data acts as a semantic bridge, aligning visual and textual features and mitigating cross‑task competition.
Stage 3 – Decoupled robustness training. Text dropout forces reliance on visual cues, while off‑screen synthesis introduces samples without visible sound sources, strengthening the model’s ability to follow textual instructions in challenging scenarios.
3. Comprehensive Benchmark (VGGSound‑Omni)
Existing benchmarks lack fine‑grained multimodal annotations. Omni2Sound therefore created VGGSound‑Omni , a panoramic benchmark covering T2A, V2A, and VT2A, with a dedicated off‑screen track that includes videos with no visible sound source and a synthetic background‑music subset. This enables objective evaluation of both audio quality and text fidelity under visual scarcity.
Experimental Results
Without any architectural tricks, Omni2Sound outperforms both specialist and prior unified models on all three tasks in the VGGSound‑Omni benchmark, achieving superior scores in distribution matching (KL/FD/FAD), audio quality (PQ/IS), and modality alignment (DS/IB/MS‑CLAP). Human blind tests also rank it highest. Moreover, on the third‑party Kling‑Audio‑Eval suite, the model maintains strong generalization across diverse video and subtitle styles.
Conclusion
Omni2Sound demonstrates that the bottleneck for unified multimodal audio generation lies in data alignment and coarse task scheduling rather than model complexity. By supplying high‑quality aligned data and a carefully staged training regime, a simple Diffusion Transformer can break the “generalist” dilemma and set a new standard for future multimodal generative research.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
