Audio-Omni: A Unified Multimodal Model for Understanding, Generating, and Editing Audio Across Sound, Music, and Speech

Audio-Omni, a unified multimodal audio model presented at SIGGRAPH 2026, combines a frozen large multimodal language model with a trainable diffusion generator to achieve state‑of‑the‑art understanding, generation, and instruction‑based editing across general sounds, music, and speech, leveraging a million‑scale AudioEdit dataset and a hybrid conditioning architecture.

Machine Heart
Machine Heart
Machine Heart
Audio-Omni: A Unified Multimodal Model for Understanding, Generating, and Editing Audio Across Sound, Music, and Speech

Audio-Omni is the first unified framework that simultaneously supports understanding, generation, and editing for three audio domains—general sounds, music, and speech. The system merges a frozen multimodal large language model (Qwen2.5‑Omni‑3B) with a trainable diffusion generator (DiT) to inherit world knowledge while providing high‑fidelity synthesis.

Unified Capabilities

In standard generation tasks, Audio‑Omni reaches state‑of‑the‑art performance on multiple benchmarks and supports diverse modality controls:

Text‑to‑Audio (T2A) : Prompt "A telephone dials twice, followed by the sound of glass shattering." produces the described sequence.

Text‑to‑Music (T2M) : Prompt "Compose a bright jazz swing instrumental with walking bass, brushed drums, and a lively horn melody." yields a coherent jazz piece.

Video‑to‑Audio (V2A) and Video‑to‑Music (V2M) enable automatic dubbing and scoring for video clips.

Text‑to‑Speech (TTS) : Prompt "The alchemist erased the circle in the sand, and the snake slithered away among the rocks." generates expressive speech.

Instruction‑Level Audio Editing

Audio‑Omni can edit audio with simple textual commands. Examples include:

Add : Prompt "Add the sound of 'skateboarding' to the input audio." seamlessly mixes a skateboard sound into the original scene.

Remove : Prompt "Remove the sound of 'female singing' from the input audio." isolates and eliminates the target source.

Extract : Prompt "Extract the sound of 'ambulance siren' from the input audio." pulls out a specific source from a mixture.

Style Transfer : Prompt "Change the sound of 'dog barking' to 'hammering'." transforms the timbre while preserving rhythm and pitch.

Inherited Capabilities from the MLLM

Because the frozen MLLM retains extensive world knowledge, Audio‑Omni exhibits abilities that typical audio models lack:

Knowledge‑Augmented Generation : When prompted with "the instrument John Bonham of Led Zeppelin played", the model infers "drums" and synthesizes a characteristic drum pattern.

In‑Context Generation : Given a short piano recording and the instruction "generate a continuously tension‑building film score", the model extracts the piano timbre and composes a new melody.

Cross‑Lingual Control : Although trained primarily on English commands, the model responds equally well to Chinese, French, German, or Japanese prompts.

Hybrid Conditioning Architecture

The core design solves the challenge of handling heterogeneous control signals:

High‑Level Semantic Stream : Multimodal features and transcribed text from the frozen MLLM are injected into the diffusion model via cross‑attention, providing global semantic guidance.

Low‑Level Signal Stream : Mel‑spectrogram features and video‑synchronization cues are concatenated channel‑wise with the noise latent, delivering fine‑grained temporal alignment.

This "macro‑by‑attention, micro‑by‑concatenation" strategy enables the model to generate high‑fidelity audio while obeying precise editing constraints.

Data Construction – AudioEdit

To overcome the scarcity of large‑scale instruction‑based audio editing data, the authors built the AudioEdit dataset with over one million high‑quality pairs. Two parallel pipelines were used:

Real Data Branch : Real video clips from VGGSound were processed with Gemini 2.5 Pro for sound source identification and SAM‑Audio for separation, followed by multi‑stage VAD and CLAP filtering to obtain clean "original‑edited" pairs.

Synthesis Data Branch : The Scaper toolkit randomly mixes foreground and background sounds with varied pitch, duration, and signal‑to‑noise ratios to generate synthetic, precisely labeled pairs.

The combined dataset supplies the diverse instruction signals required for robust editing.

Ablation Insight – Penultimate Layer Features

Experiments comparing different feature extraction points from the MLLM revealed that using the penultimate layer (‑2) yields significantly better audio generation quality than the last layer (‑1) or complex query mechanisms. The authors attribute this to the last layer being overly specialized for next‑token prediction, discarding acoustic detail, whereas the penultimate layer retains both high‑level semantics and rich low‑level information.

Open‑Source Release and Impact

Audio‑Omni’s code and model weights are released on GitHub and Hugging Face. Since release, the model has consistently ranked in the Top 5 of the Hugging Face multimodal (Any‑to‑Any) leaderboard, attracting broad community attention.

Conclusion

Audio‑Omni demonstrates that a single unified framework can bridge the gaps between audio understanding, generation, and editing across diverse domains, offering strong reasoning, zero‑shot control, and cross‑lingual capabilities that point toward the future of universal generative audio intelligence.

MLLMAudio-OmniAudioEditDiffusion GenerationMultimodal AudioZero-shot Audio Editing
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.