Artificial Intelligence 20 min read

Inside MOVA: Open-Source End-to-End Audio-Video Generation

OpenMOSS and MOSI unveiled MOVA, China’s first high‑performance open‑source audio‑video generation model, detailing its dual‑tower architecture, bridge module, aligned ROPE, multi‑stage data pipeline, training strategies, dual CFG guidance, and benchmark results that surpass leading closed‑source systems.

AI Frontier Lectures

Jan 30, 2026

Inside MOVA: Open-Source End-to-End Audio-Video Generation

Model Overview

MOVA is a ~32‑billion‑parameter mixture‑of‑experts (MoE) model that supports both image‑to‑audio and text‑to‑audio‑video generation. It consists of a large video DiT backbone (≈14 B parameters, based on Wan 2.2 I2V) and a smaller 1.3 B‑parameter audio diffusion backbone.

Architecture

The model uses an asymmetric dual‑tower design coupled by a bidirectional bridge module that applies cross‑attention between video and audio hidden states at every layer. To align the different temporal resolutions (video 24 fps vs. high‑rate audio), MOVA introduces Aligned ROPE , a positional‑embedding scheme that maps both modalities onto a shared physical time axis.

Data Pipeline

Pre‑process raw footage into fixed 720p, 24 fps, 8.05 s clips.

Filter clips for audio quality, video quality, and audio‑video sync.

Generate modality‑specific annotations with audio and visual understanding models, then fuse the descriptions using a large language model to obtain fine‑grained multimodal captions.

This pipeline preserves most of the original content while providing detailed annotations that improve generalisation to complex scenes.

Training Strategy

360p Stage 1 : Low‑resolution training focuses on learning basic audio‑visual alignment; the bridge module uses a learning rate twice that of the frozen towers.

360p Stage 2 : Still at 360p, the goal shifts to stabilising and refining alignment; text dropout is reduced and loudness normalisation (LUFS) prevents volume distortion.

720p Stage 3 : High‑resolution fine‑tuning leverages the already‑stable alignment to improve visual detail, using finer checkpoint granularity and aggressive parallel optimisation.

During training, a Dual Sigma Shift applies different noise schedules to audio and video, mitigating modality‑specific bias in the diffusion process.

Agent Workflow

Visual Understanding : Qwen‑3‑VL parses the initial image into visual constraints.

Prompt Reconstruction : A general LLM (e.g., Gemini) rewrites the user prompt into a format better aligned with the model’s training distribution.

Dual‑Condition Generation : The reconstructed prompt and visual constraints jointly guide generation, preserving style and ensuring precise lip‑sync.

The system also supports Dual Classifier‑Free Guidance (Dual CFG) to balance textual fidelity and bridge strength, and includes LUFS‑based loudness normalisation to keep audio clear under strong guidance.

Experimental Results

On Verse‑Bench, MOVA‑720p achieves the best Lip‑Sync Error (LSE‑D = 7.094, LSE‑C = 7.452) and top cpCER scores, outperforming open‑source baselines LTX‑2, Ovi and the cascade WAN 2.1 + MMAudio pipeline. In a human‑in‑the‑loop Arena evaluation with over 5,000 votes, MOVA attains an ELO of 1113.8 and a win‑rate above 70 % against OVI and WAN + MMAudio.

Open‑Source Release

The full stack—including model weights, training code, inference code, and fine‑tuning recipes—is released at https://github.com/OpenMOSS/MOVA and the project homepage https://mosi.cn/models/mova. MOVA integrates with high‑performance inference frameworks such as SGLang, and a 360p variant is provided for modest GPU hardware.