How Ximalaya Used Generative AI to Revolutionize Audio Recommendations

This article details Ximalaya's journey from traditional multi‑stage recommendation pipelines to generative AI‑driven models, covering business challenges, architectural and model differences, phased deployments, knowledge distillation, semantic ID encoding, decoder‑only strategies, extensive offline and online evaluations, and future research directions.

Ximalaya Technology Team
Ximalaya Technology Team
Ximalaya Technology Team
How Ximalaya Used Generative AI to Revolutionize Audio Recommendations

Business Background

Ximalaya serves tens of millions of daily active users with hundreds of millions of diverse audio items; recommendation quality directly impacts user experience, retention, and commercial value. Compared with short‑video or e‑commerce, long‑form audio suffers from low feedback density, difficult content understanding, cold‑start hurdles, and high user decision cost, yet exhibits strong sequential consumption patterns that suit generative recommendation.

Technical Challenges

User feedback density low : Users often browse titles, descriptions, hosts, and categories before playing, resulting in low click‑through rates and sparse interaction data.

Signal noise strong : Passive listening (e.g., background music) inflates play, completion, and duration metrics, which do not reliably reflect satisfaction.

Content understanding difficulty : Long audio has high information density and varied importance; shallow features cannot capture core semantics, and fine‑grained encoding incurs heavy compute cost.

Cold‑start amplified : New long‑form items lack interaction data, making pure metadata‑based recommendation ineffective, while high decision cost discourages users from trying unknown content.

Core Differences and Value

1. Architecture Differences

Multi‑stage cascade (recall → coarse ranking → fine ranking → re‑ranking) leads to inconsistent objectives, module fragmentation, and low CPU utilization (e.g., CTR model CPU usage <20%).

Generative recommendation proposes a unified end‑to‑end architecture that optimizes a global objective, reduces feature engineering, and aligns model training with business goals.

2. Model Structure Differences

Generative recommendation model : Transformer‑based, predicts the next most likely item from a user’s behavior sequence, captures long‑range dependencies, but has larger parameters and higher computational complexity.

Traditional discriminative model : Embedding + MLP pipeline, smaller parameter count, lower latency, but suffers from scalability limits, cold‑start, and sparsity issues.

Implementation Journey

April 2025 – research kickoff focusing on large‑model fine‑tuning to improve recommendations for low‑frequency users.

August 2025 – full rollout of the generative solution on the homepage album recommendation scene.

October 2025 – playback‑page album recommendation covered entirely.

November 2025 – encoder‑decoder end‑to‑end training deployed to all daily active users; model size reached ~1 billion parameters.

January 2026 – decoder‑only audio generation model launched on the homepage.

Phase 1: Direct LLM Recall

Attempted to use user text data combined with DeepSeek for direct recall, but faced prohibitive computation cost (parameter count in the hundreds of billions) and domain mismatch, as generic LLMs are optimized for language generation rather than recommendation relevance.

Phase 2: Knowledge Distillation + Model Fine‑tuning

2.1 Architecture

Knowledge from a large LLM is distilled into a smaller model, preserving inference ability while drastically reducing compute and storage. The distillation process uses chain‑of‑thought data generated by the large model for album recall tasks.

2.2 Knowledge Distillation

The distillation pipeline extracts the LLM’s reasoning steps as supervision signals for the student model, ensuring the smaller model learns the same multi‑step decision logic.

2.3 Model Fine‑tuning

Domain data fine‑tuning : Adapt the LLM to Ximalaya’s audio domain by training on album titles, hosts, categories, tags, play counts, and sentiment metrics.

Intent‑aware prompt engineering : Combine long‑term interests, short‑term interests, search queries, subscriptions, user profile, and external entry intent into a structured <Prompt, Response> pair.

Training pipeline : Tokenize user profile features and behavior sequences, use LoRA for efficient parameter adaptation, and merge LoRA weights with the base Qwen model.

2.4 Model Application

Offline batch inference merges LoRA‑adjusted Qwen parameters with the base model, then runs on the vllm inference framework. Through vllm and additional optimizations, throughput rose from 0.28 users/s to 3.14 users/s (≈10×), handling ~2.1 million users per A800 GPU per day.

Phase 3: Encoder‑Decoder with Semantic ID

To overcome raw Item‑ID limitations, a semantic ID (SID) is generated via RQ‑VAE, encoding side‑information (title, host, category, tags, etc.) into a compact multi‑dimensional code. This reduces token count, improves cold‑start performance, and enables fine‑grained semantic similarity.

Phase 4: Decoder‑only Audio Generation

The second‑stage shift adopts a decoder‑only LLM (Qwen1.5B‑instruct) where the user’s playback sequence is treated as a textual prompt. The model predicts the next item token sequence in an autoregressive fashion, eliminating the need for extensive feature engineering.

【system】You are a content recommendation expert. Each item is represented by four consecutive tokens, e.g., <a_1><b_1><c_1><d_1>.
【user】'<a_189><b_127><c_185><d_152>', '<a_110><b_107><c_189><d_169>', '<a_131><b_18><c_15><d_174>', '<a_131><b_170><c_177><d_246>', '<a_110><b_153><c_17><d_202>', '<a_110><b_207><c_193><d_156>'
【assistant】'<a_189><b_127><c_74><d_13>'

Effect Evaluation

Offline metrics: top‑N recall and accuracy improved by 40‑50 % compared with the original Qwen‑14B baseline.

Online A/B tests: significant uplift in user activity, CTR, and completion rates across homepage and playback‑page scenarios.

Reinforcement Learning Exploration

Future work integrates SFT + RLHF to align the generative recommender with human preferences. A composite reward function includes format reward (ensuring correct SID output), existence reward (preventing generation of non‑existent items), correctness reward (matching ground truth), and ranking reward (promoting top‑probability correct answers).

Future Outlook

Multimodal capability upgrades: incorporate image and text modalities to enrich item representations.

Iterative training paradigms: combine next‑token pre‑training (NTP) with RLHF for better long‑term preference modeling.

Inference performance optimization: adopt state‑of‑the‑art inference frameworks to further improve latency and resource utilization.

recommendation systemsreinforcement learningGenerative AIKnowledge DistillationEncoder-Decodersemantic IDaudio recommendation
Ximalaya Technology Team
Written by

Ximalaya Technology Team

Official account of Ximalaya's technology team, sharing distilled technical experience and insights to grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.