Artificial Intelligence 17 min read

Application and Exploration of Large Audio Representation Models for Cold-Start Songs in QQ Music

This article presents a technical overview of how large‑scale audio representation models are fine‑tuned with I2I co‑occurrence and U2I interaction data to improve cold‑start song recommendation on QQ Music, describing the challenges, methodology, deployment scenarios, and experimental results.

DataFunTalk

Jun 30, 2024

Application and Exploration of Large Audio Representation Models for Cold-Start Songs in QQ Music

This document summarizes a talk given at the DataFun Summit 2024 Recommendation Architecture track, where the author shares materials on applying large audio representation models to the cold‑start problem in QQ Music's recommendation system.

Problem and background : QQ Music faces a long‑tail distribution similar to short‑video and note recommendation, with many newly released or old niche songs that receive insufficient exposure. Existing ID‑based recommendation suffers because cold items lack historical interaction, leading to under‑trained ID embeddings.

Application challenges : The semantic space of audio embeddings differs from the recommendation behavior space, and audio embeddings are typically high‑dimensional, making direct concatenation with ID embeddings ineffective without dimensionality reduction and space alignment.

Common usage : Simple concatenation or PCA‑reduced concatenation of raw audio embeddings often degrades performance due to noise and space mismatch. Converting embeddings to categorical tags is more stable but still limited.

2. Fine‑tuning with I2I co‑occurrence data

2.1 MPE representation : A triplet‑loss training scheme (positive‑positive‑negative) produces a 40‑dim embedding (MPE) that incorporates I2I co‑occurrence signals.

2.2 Applications : MPE embeddings are used in the I2I2U song placement pipeline to retrieve similar songs, fetch users who have collected them, and push the songs via Redis; they also improve I2I recall in autoplay and AI‑voice synthesis scenarios.

2.3 Aligning ID‑Embed and MPE : Direct concatenation is still ineffective; instead, similar‑song IDs derived from MPE are aggregated (Mean Pooling or Transformer) to create auxiliary item features.

2.4 MPE‑similar ID usage : The similar‑ID feature is incorporated into a dual‑tower model for U2I placement and into re‑ranking to bias long‑tail songs, yielding noticeable online gains.

3. Fine‑tuning with U2I interaction data

3.1 Comparison : U2I data is 2–3 orders of magnitude larger than I2I, requiring efficient training strategies.

3.2 Incorporating U2I signals : (1) Load pre‑trained high‑dim audio embeddings into a sparse lookup table (via HDFS or a lightweight regression model). (2) Add an encoder (simple MLP) that maps these embeddings to the same dimension as ID embeddings, enabling end‑to‑end training.

3.3 Using the encoder : The encoder can be extracted to generate mapped audio features, which, when used to initialize ID embeddings, improve cold‑start performance.

Key findings: (1) Large‑model audio features need task‑specific fine‑tuning; (2) U2I‑fine‑tuned embeddings outperform I2I‑fine‑tuned ones; (3) Initializing ID embeddings with the mapped audio features is a practical way to bridge the gap.

3.4 Model update workflow : The encoder is updated periodically (not daily); ID embeddings for new items are initialized with the encoder output, while existing embeddings remain unchanged, keeping the update cost low.

4. Summary and Outlook

The presented solutions demonstrate three practical pathways—content‑only, I2I‑based, and U2I‑based—to leverage large audio representation models for alleviating cold‑start issues in music recommendation, and the approaches are expected to generalize to other modalities.

Future work includes joint I2I and U2I fine‑tuning and exploring lighter, end‑to‑end architectures to further improve recommendation performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

music recommendation audio representation cold-start recommendation I2I fine-tuning U2I fine-tuning

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.