Artificial Intelligence 14 min read

How VeM Achieves Precise Semantic, Temporal, and Rhythmic Alignment in Video-to-Music Generation

The VeM model introduces a latent diffusion framework that leverages hierarchical video parsing, scene‑guided cross‑attention, and a transition‑beat alignment adapter to generate high‑fidelity background music perfectly synchronized with video semantics, timing, and rhythm, outperforming existing baselines on extensive quantitative and qualitative evaluations.

Alimama Tech

Dec 17, 2025

How VeM Achieves Precise Semantic, Temporal, and Rhythmic Alignment in Video-to-Music Generation

1. Introduction

Video‑to‑Music (V2M) generation aims to produce background music that aligns with a given video in semantics, timing, and rhythm, enhancing audiovisual experience. Existing methods suffer from incomplete video representation and weak audio‑visual temporal synchronization, especially at beat‑level precision.

The Peking‑Alibaba AI Innovation Joint Lab proposes VeM (Video‑Echoed in Music) , a latent diffusion model that generates high‑quality music tracks with strong semantic, temporal, and rhythmic alignment. The method is validated by extensive experiments and an oral paper acceptance at AAAI 2026.

Paper title: Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation

Paper link: https://arxiv.org/pdf/2511.09585

Project page: https://vem-paper.github.io/VeM-page

2. Method

VeM addresses V2M challenges with a latent diffusion model that incorporates three core modules.

2.1 Hierarchical Video Parsing

The parser extracts three levels of information from the input video:

Global : overall theme, atmosphere, and emotion.

Storyboard : segmented shots with narrative description, visual content, and absolute start time.

Frame‑level : precise timestamps of scene transitions.

Global cues come from a multimodal language model (MLLM) and music‑style classifiers; storyboard cues provide local visual features, descriptions, and durations; frame‑level cues are detected by a scene‑change detector. These annotations are pre‑processed and manually refined before training.

2.2 Scene‑Guided Cross‑Attention (SG‑CAtt)

Standard cross‑attention lacks fine‑grained temporal modeling. SG‑CAtt concatenates global features with storyboard features, using them as Key and Value in the attention mechanism, while the diffusion latent serves as Query. A storyboard mask sMask restricts attention to tokens within the same shot, preserving semantic consistency across shots and enabling precise local time synchronization.

Mathematically, the attention computation follows the standard scaled dot‑product formulation, with the mask applied to the similarity matrix before softmax.

2.3 Transition‑Beat Alignment Adapter (TB‑As)

A binary sequence from the video parser marks frame‑level transitions (1 for a transition, 0 otherwise). An RNN‑based beat detector produces a matching binary beat sequence. Their intersection defines timestamps where visual cuts should align with musical beats. A ResNet‑(2+1)D‑based Aligner is trained with a BCE loss to predict these timestamps. The Adapter, inspired by AdaLN, normalizes music features via an MLP into a scale and shift factor, injecting the alignment information into the diffusion latent.

2.4 Training and Inference

Training proceeds in stages:

Pre‑train a music reconstruction VAE and the transition‑beat Aligner independently.

Freeze the VAE, Aligner, text encoder, and video encoder.

Train the latent diffusion model, updating only the time‑embedding parameters to focus on semantic and temporal cues from the hierarchical parser.

Finally, integrate the frozen Aligner and jointly fine‑tune the Adapter to improve rhythmic consistency.

During inference, random noise initializes the latent diffusion model. The hierarchical parser supplies conditioning embeddings, the Aligner predicts transition‑beat cues, and the Adapter injects these cues into the music latent, producing a synchronized audio track.

3. Experiments

3.1 Dataset

A new high‑quality video‑music pairing dataset, TB‑Match , contains ~18,000 samples sourced from e‑commerce ads and mainstream video platforms, emphasizing frequent and precise scene‑to‑beat synchronization. Additional data from M2UGen (13,000 pairs) and supplementary datasets (SymMV, Sora silent videos) raise the total training duration to ~280 hours.

3.2 Quantitative Metrics

VeM is compared against five baselines across nine objective metrics, covering music quality, semantic alignment, temporal sync, and rhythmic precision. VeM consistently outperforms audio‑only methods (GVMGen, VidMuse, M2UGen) and MIDI‑based methods (CMT, Diff‑BGM).

3.3 Qualitative Evaluation

Human studies (expert and non‑expert participants) show VeM achieving the highest Top‑1 preference rate. Mean Opinion Scores for quality (MOS‑Q) and alignment (MOS‑A) confirm superior perceived performance across diverse evaluator backgrounds.

3.4 Results Showcase

Applied in Alibaba’s intelligent video production pipeline, VeM demonstrates smooth music generation and strong beat‑to‑visual alignment in e‑commerce ad scenarios as well as general video content.

4. Conclusion

VeM introduces a comprehensive solution for video‑to‑music generation by combining hierarchical video parsing, scene‑guided cross‑attention, and a transition‑beat alignment adapter within a latent diffusion framework. The new TB‑Match dataset and extensive evaluations substantiate its advantages. Future work will explore joint audio‑visual generation and broader application scenarios.

References

[1] S. Bai et al., “Qwen2.5‑VL Technical Report,” arXiv:2502.13923, 2025.

[2] B. Castellano, “PySceneDetect,” https://github.com/Breakthrough/PySceneDetect, 2024.

[3] H. Liu et al., “Audioldm 2: Learning holistic audio generation with self‑supervised pretraining,” IEEE/ACM Trans. Audio Speech Lang. Process., 2024.

[4] H. Zuo et al., “GVMGen: A General Video‑to‑Music Generation Model With Hierarchical Attentions,” AAAI, 2025.

[5] S. Liu et al., “Multi‑modal Music Understanding and Generation with the Power of Large Language Models,” arXiv:2311.11255, 2023.

[6] L. Zhuo et al., “Video background music generation: Dataset, method and evaluation,” IEEE/CVF ICCV, 2023.

[7] Z. Tian et al., “VidMuse: A simple video‑to‑music generation framework with long‑short‑term modeling,” arXiv:2406.04321, 2024.

[8] S. Di et al., “Video background music generation with controllable music transformer,” ACM MM, 2021.

[9] S. Li et al., “Diff‑BGM: A Diffusion Model for Video Background Music Generation,” IEEE/CVF CVPR, 2024.

latent diffusion Cross-Attention temporal alignment Audio Generation video-to-music

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.