How ByteDance’s AI Transforms Music Creation and Discovery on TikTok

ByteDance leverages advanced AI models such as SpectTNT, semi‑supervised music tagging transformers, language identification, chord recognition, contrastive representation learning, and source separation to power TikTok’s massive music library, enabling seamless music‑video interaction, smarter recommendations, and new creative tools for creators worldwide.

Volcano Engine Developer Services
Volcano Engine Developer Services
Volcano Engine Developer Services
How ByteDance’s AI Transforms Music Creation and Discovery on TikTok

Music & Visual Interaction Technology Simplifies Creation

TikTok has become a major channel for music promotion, turning short‑video BGM hits into chart‑topping tracks across platforms. ByteDance’s vast music library, containing billions of audio fragments, is powered by SAMI (Speech, Audio and Music Intelligence), which enables deep audio analysis and intelligent content creation.

SpectTNT: Time‑Frequency Transformer for Music Audio

SpectTNT is a novel deep‑learning model designed for music spectrogram extraction. It converts audio signals into spectrograms via short‑time Fourier transform, then applies time‑frequency transformer layers with residual connections to capture high‑level features, supporting tasks such as vocal‑melody extraction, music structure analysis, and improved audio‑visual matching.

ISMIR 2021 Paper: SpecTNT: a Time‑Frequency Transformer for Music Audio
SpectTNT architecture diagram
SpectTNT architecture diagram

Semi‑Supervised Music Tagging Transformer

To organize the massive music catalog, ByteDance introduced a semi‑supervised Transformer model that tags music by genre and similarity, reducing reliance on manual labeling and outperforming large‑scale residual networks. This model powers music recommendation in products like Resso, TikTok, and Jianying.

ISMIR 2021 Paper: Semi‑supervised Music Tagging Transformer
Music tagging workflow
Music tagging workflow

Music Language Identification for Multilingual Users

ByteDance’s music language identification system detects dozens of languages within a track, providing language composition ratios that improve user retention in multilingual markets. The system combines log‑Mel spectrograms, a 50‑layer deep residual network, and multimodal metadata to output language predictions.

ISMIR 2021 Paper: Listen, Read, and Identify: Multimodal Singing Language Identification of Music
Language identification model diagram
Language identification model diagram

Automatic Chord Recognition Enhances AI Composition

A deep autoregressive distillation model (NADE) enables rich chord recognition across massive MIDI datasets, providing high‑quality chord segments for AI‑driven music composition and improving the coherence of generated music.

ISMIR 2021 Paper: A deep learning method for enforcing coherence in Automatic Chord Recognition
Chord recognition results
Chord recognition results

Contrastive Learning of Musical Representations (CLMR)

CLMR learns universal music embeddings with minimal labeled data using contrastive learning and extensive audio augmentations. The model achieves strong performance on downstream audio classification tasks, reducing data annotation costs.

ISMIR 2021 Paper: Contrastive Learning of Musical Representations
CLMR training pipeline
CLMR training pipeline

Music Structure Analysis for Creative Potential

ByteDance’s music structure analysis detects highlights and loops, enabling intelligent music length extension in editing tools like Xigua. This technology improves natural transitions and supports various creative video effects.

ISMIR 2021 Paper: Supervised Metric Learning for Music Structure Features
Music structure detection
Music structure detection

Music Source Separation Advances

A 143‑layer deep residual network jointly estimates magnitude and phase spectra, surpassing traditional methods. The model achieves an 8.98 dB improvement in vocal separation, facilitating tasks such as background music replacement and high‑quality source extraction.

ISMIR 2021 Paper: Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation
Source separation network
Source separation network
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

deep learningaudio processingmusic recommendationmusic AIlanguage identificationmusic taggingsource separation
Volcano Engine Developer Services
Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.