Artificial Intelligence 9 min read

NeurIPS 2025‑Selected Multi‑Stream Control Framework Achieves Precise Audio‑Visual Sync via Audio Demixing

The paper introduces a NeurIPS 2025‑selected multi‑stream video generation framework that demixes audio into speech, effects, and music, using dedicated control streams and a multi‑stage training strategy to achieve markedly better lip‑sync, event timing, and overall visual quality than prior methods.

HyperAI Super Neural

Dec 23, 2025

NeurIPS 2025‑Selected Multi‑Stream Control Framework Achieves Precise Audio‑Visual Sync via Audio Demixing

Existing audio‑driven video generation methods treat the input audio as a single holistic condition, which blurs the correspondence between audio components and visual elements and makes precise lip‑sync, event timing, and global visual atmosphere control difficult.

Multi‑Stream Temporal Control Framework (MTV)

The framework first demixes the input audio into three separate tracks—speech, sound effects, and music. Each track drives a distinct visual generation sub‑task: speech controls lip movements, effects control event sequencing, and music controls overall visual mood.

MTV incorporates a Multi‑Stream Temporal Control Network (MST‑ControlNet) that simultaneously handles fine‑grained local interval synchronization and global style modulation.

Interval Feature Injection

An interval stream extracts per‑track features via an interval interaction module, models cross‑track interactions with self‑attention, and injects the fused features into each time interval using cross‑attention.

Global Feature Injection

A global stream extracts a segment‑level visual‑mood embedding with a global context encoder, applies average pooling to obtain a global feature, and modulates the video latent code through AdaLN.

DEMIX Dataset and Multi‑Stage Training

DEMIX is built by filtering raw video‑audio pairs into five overlapping subsets: basic‑face, single‑person, multi‑person, event‑effects, and environmental‑mood.

Stage 1: train on the basic‑face subset to learn lip motion.

Stage 2: add the single‑person subset to learn body pose, scene appearance, and camera motion.

Stage 3: incorporate the multi‑person subset for handling multiple speakers.

Stage 4: focus on event sequencing using the event‑effects subset, extending visual understanding from humans to objects.

Stage 5: train on the environmental‑mood subset to improve visual emotion representation.

Multi‑Functional Generation Capabilities

MTV can generate character‑centric narratives, multi‑character interactions, sound‑triggered events, music‑driven atmosphere, and camera motion.

Comprehensive Evaluation

Metrics used include video quality (FVD), temporal consistency (Temp‑C), multimodal alignment (Text‑C, Audio‑C), and synchronization confidence and error (Sync‑C, Sync‑D). Compared with three state‑of‑the‑art baselines (MM‑Diffusion, TempoTokens, and Xing et al.), MTV achieves significantly lower FVD, maintains high Temp‑C stability, markedly improves Audio‑C, and attains the best Sync‑C and Sync‑D scores, demonstrating the effectiveness of audio demixing and multi‑stream control.

Qualitative comparisons show that prior methods struggle with stable narrative structure and accurate audio‑driven actions in complex or cinematic scenes, whereas MTV consistently produces high‑quality, tightly synchronized videos.

Paper: https://arxiv.org/abs/2506.08003