How Ximalaya Used Generative AI to Revolutionize Audio Recommendations
This article details Ximalaya's journey from traditional multi‑stage recommendation pipelines to generative AI‑driven models, covering business challenges, architectural and model differences, phased deployments, knowledge distillation, semantic ID encoding, decoder‑only strategies, extensive offline and online evaluations, and future research directions.
Business Background
Ximalaya serves tens of millions of daily active users with hundreds of millions of diverse audio items; recommendation quality directly impacts user experience, retention, and commercial value. Compared with short‑video or e‑commerce, long‑form audio suffers from low feedback density, difficult content understanding, cold‑start hurdles, and high user decision cost, yet exhibits strong sequential consumption patterns that suit generative recommendation.
Technical Challenges
User feedback density low : Users often browse titles, descriptions, hosts, and categories before playing, resulting in low click‑through rates and sparse interaction data.
Signal noise strong : Passive listening (e.g., background music) inflates play, completion, and duration metrics, which do not reliably reflect satisfaction.
Content understanding difficulty : Long audio has high information density and varied importance; shallow features cannot capture core semantics, and fine‑grained encoding incurs heavy compute cost.
Cold‑start amplified : New long‑form items lack interaction data, making pure metadata‑based recommendation ineffective, while high decision cost discourages users from trying unknown content.
Core Differences and Value
1. Architecture Differences
Multi‑stage cascade (recall → coarse ranking → fine ranking → re‑ranking) leads to inconsistent objectives, module fragmentation, and low CPU utilization (e.g., CTR model CPU usage <20%).
Generative recommendation proposes a unified end‑to‑end architecture that optimizes a global objective, reduces feature engineering, and aligns model training with business goals.
2. Model Structure Differences
Generative recommendation model : Transformer‑based, predicts the next most likely item from a user’s behavior sequence, captures long‑range dependencies, but has larger parameters and higher computational complexity.
Traditional discriminative model : Embedding + MLP pipeline, smaller parameter count, lower latency, but suffers from scalability limits, cold‑start, and sparsity issues.
Implementation Journey
April 2025 – research kickoff focusing on large‑model fine‑tuning to improve recommendations for low‑frequency users.
August 2025 – full rollout of the generative solution on the homepage album recommendation scene.
October 2025 – playback‑page album recommendation covered entirely.
November 2025 – encoder‑decoder end‑to‑end training deployed to all daily active users; model size reached ~1 billion parameters.
January 2026 – decoder‑only audio generation model launched on the homepage.
Phase 1: Direct LLM Recall
Attempted to use user text data combined with DeepSeek for direct recall, but faced prohibitive computation cost (parameter count in the hundreds of billions) and domain mismatch, as generic LLMs are optimized for language generation rather than recommendation relevance.
Phase 2: Knowledge Distillation + Model Fine‑tuning
2.1 Architecture
Knowledge from a large LLM is distilled into a smaller model, preserving inference ability while drastically reducing compute and storage. The distillation process uses chain‑of‑thought data generated by the large model for album recall tasks.
2.2 Knowledge Distillation
The distillation pipeline extracts the LLM’s reasoning steps as supervision signals for the student model, ensuring the smaller model learns the same multi‑step decision logic.
2.3 Model Fine‑tuning
Domain data fine‑tuning : Adapt the LLM to Ximalaya’s audio domain by training on album titles, hosts, categories, tags, play counts, and sentiment metrics.
Intent‑aware prompt engineering : Combine long‑term interests, short‑term interests, search queries, subscriptions, user profile, and external entry intent into a structured <Prompt, Response> pair.
Training pipeline : Tokenize user profile features and behavior sequences, use LoRA for efficient parameter adaptation, and merge LoRA weights with the base Qwen model.
2.4 Model Application
Offline batch inference merges LoRA‑adjusted Qwen parameters with the base model, then runs on the vllm inference framework. Through vllm and additional optimizations, throughput rose from 0.28 users/s to 3.14 users/s (≈10×), handling ~2.1 million users per A800 GPU per day.
Phase 3: Encoder‑Decoder with Semantic ID
To overcome raw Item‑ID limitations, a semantic ID (SID) is generated via RQ‑VAE, encoding side‑information (title, host, category, tags, etc.) into a compact multi‑dimensional code. This reduces token count, improves cold‑start performance, and enables fine‑grained semantic similarity.
Phase 4: Decoder‑only Audio Generation
The second‑stage shift adopts a decoder‑only LLM (Qwen1.5B‑instruct) where the user’s playback sequence is treated as a textual prompt. The model predicts the next item token sequence in an autoregressive fashion, eliminating the need for extensive feature engineering.
【system】You are a content recommendation expert. Each item is represented by four consecutive tokens, e.g., <a_1><b_1><c_1><d_1>. 【user】'<a_189><b_127><c_185><d_152>', '<a_110><b_107><c_189><d_169>', '<a_131><b_18><c_15><d_174>', '<a_131><b_170><c_177><d_246>', '<a_110><b_153><c_17><d_202>', '<a_110><b_207><c_193><d_156>' 【assistant】'<a_189><b_127><c_74><d_13>'Effect Evaluation
Offline metrics: top‑N recall and accuracy improved by 40‑50 % compared with the original Qwen‑14B baseline.
Online A/B tests: significant uplift in user activity, CTR, and completion rates across homepage and playback‑page scenarios.
Reinforcement Learning Exploration
Future work integrates SFT + RLHF to align the generative recommender with human preferences. A composite reward function includes format reward (ensuring correct SID output), existence reward (preventing generation of non‑existent items), correctness reward (matching ground truth), and ranking reward (promoting top‑probability correct answers).
Future Outlook
Multimodal capability upgrades: incorporate image and text modalities to enrich item representations.
Iterative training paradigms: combine next‑token pre‑training (NTP) with RLHF for better long‑term preference modeling.
Inference performance optimization: adopt state‑of‑the‑art inference frameworks to further improve latency and resource utilization.
Ximalaya Technology Team
Official account of Ximalaya's technology team, sharing distilled technical experience and insights to grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
