Baton: Semantic Blueprint Enables Precise Audio‑Video Synchronization in Generation
Current open‑source audio‑video generators struggle with complex, multi‑stage prompts, leading to misaligned actions and sounds; Baton, introduced by Fudan University and Tencent, decouples semantic reasoning from content generation via a shared cross‑modal semantic blueprint and RS‑RoPE, achieving markedly better synchronization and prompt adherence.
Problem Statement
Open‑source video‑audio generation models handle simple prompts but fail on multi‑stage actions, complex character interactions, or precise temporal alignment. Typical failures include mismatched actions and sounds, incorrect dialogue mapping, and unsynchronized audio‑visual rhythms.
Root Cause of Existing Methods
Most approaches encode the entire text prompt into a single global semantic vector and inject it simultaneously into video and audio diffusion processes. This provides scene‑level guidance but cannot decompose long‑range semantic relationships nor explicitly describe how characters, actions, and sounds correspond over time.
Related Work
Ovi introduced a native video‑audio joint generation framework with a dual‑branch DiT architecture. LTX‑2.3 scaled model size and data quality, and MOVA enhanced training strategies and cross‑modal collaboration. Recent multimodal large language models (e.g., Qwen3, Qwen3‑VL, Qwen3‑Omni) have been used to expand or rewrite prompts, yet they still compress complex prompts into a unified global representation.
Baton Overview
Baton decouples semantic reasoning from content generation. It first constructs a cross‑modal shared Semantic Blueprint and then generates video and audio synchronously based on this blueprint.
VA‑Planner
VA‑Planner employs a multimodal large language model (MLLM) to perform explicit semantic reasoning on the user prompt, producing two sets of Planned Tokens —one for video and one for audio. These tokens encode what happens, where, and when, forming a fine‑grained semantic blueprint that guides subsequent diffusion.
Dual Semantic Alignment Towers
Two modality‑specific towers (video and audio) map Planned Tokens from language space to visual/audio feature space. Each tower uses learnable queries, SigLip2 (video) and WavTokenizer (audio) as supervision and incorporates bidirectional cross‑modal attention, allowing video planning to attend to audio information and vice‑versa, yielding a unified blueprint on a shared timeline.
Relative Semantic RoPE (RS‑RoPE)
Planned Tokens and diffusion latents reside on different spatio‑temporal grids. RS‑RoPE builds a unified relative semantic coordinate system that maps both onto the same reference space, enabling precise alignment between the blueprint and diffusion latents during denoising. This acts as a navigation system ensuring consistent audio‑visual evolution.
Training Strategy
VA‑Planner pre‑training : learns to translate prompts into cross‑modal semantic plans using real video‑audio pairs as supervision.
DiT adaptation : trains the diffusion Transformer (DiT) to model the distribution of these semantic features, preventing interference from planner prediction errors.
Joint fine‑tuning : freezes VA‑Planner parameters and trains DiT with planned tokens as conditional input, reducing exposure bias and improving stability.
Experiments
Quantitative comparisons were performed on Verse‑Bench (simple‑scene benchmark) and Sem100 (100 complex prompts featuring multi‑person interactions and sequential actions). Evaluation metrics covered video quality (AQ, IQ, DD, ID), audio quality (PQ, CU), synchronization (Sync‑C, Sync‑D, DeSync), and prompt adherence (P‑Acc).
On Verse‑Bench Baton matches the leading open‑source model LTX‑2. On the more challenging Sem100 Baton outperforms LTX‑2, improving prompt‑adherence accuracy (P‑Acc) by 32 %, multi‑speaker word error rate (M‑WER) by 76 %, and desynchronization (DeSync) by 30 %.
The M‑WER gain is notable because multi‑speaker scenarios require the model to determine “who speaks when,” a capability enabled by Baton’s fine‑grained temporal semantic planning.
Comparisons with several closed‑source commercial models show that while Baton lags behind top commercial systems in visual fidelity and audio aesthetics, it surpasses them in handling complex, multi‑stage prompts.
Conclusion
Explicit cross‑modal semantic planning via a shared blueprint and precise alignment mechanisms (RS‑RoPE) substantially improves audio‑video synchronization and complex prompt compliance, addressing a key limitation of prior global‑embedding approaches.
Paper: https://arxiv.org/pdf/2605.25195
Project page: https://francis-rings.github.io/Baton/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
