How DeepRec Cut Ximalaya AI Cloud Training Time by 50% and Boosted CTR
Ximalaya’s AI Cloud platform leverages Alibaba’s DeepRec to tackle high‑dimensional sparse feature challenges, accelerate model training by over 50%, enable minute‑level model updates, and improve recommendation metrics, while outlining implementation details, multi‑tier storage, real‑time training, and future inference enhancements.
Business Overview
Ximalaya app recommends content through scenarios such as Daily Must-Listen, Hot Today, Private FM, etc. Ximalaya AI Cloud provides a one‑stop algorithm platform covering data, features, models and services, offering visual modeling and component‑based pipelines that let users build data→feature→sample→model→service workflows without coding.
Collaboration Background
Rapid growth in algorithm capability and search advertising drove the platform from traditional machine learning to deep learning, requiring larger sample sizes, feature dimensions and model complexity. The stack uses Spark for data processing, Kubernetes for GPU scheduling, and TensorFlow for training, but faces two main pain points:
High‑dimensional Sparse Feature Support
Hash collision: Simple hashing leads to >20% collision at tens of millions of buckets; a multi‑hash scheme reduces it to 0.2‰ with 95% fewer parameters, though it expands long‑sequence IDs.
Feature admission/eviction/variable length: Proper configuration shrinks model size and stabilizes metrics, enabling training of models with billions of feature dimensions.
Rapid Model Iteration
Minute‑level updates: Current daily model updates are insufficient for high‑frequency scenarios; data back‑flow, processing, training and upload times limit speed, and full‑model deployments increase latency.
To address these, the team evaluated Alibaba DeepRec, Tencent TFRA/DynamicEmbedding, Nvidia HugeCTR and chose Alibaba’s open‑source DeepRec.
DeepRec Deployment
High‑dimensional Sparse Features
DeepRec’s EmbeddingVariable uses a dynamic hash‑map‑like structure, allowing elastic feature counts, reducing conflicts and memory usage. It supports feature admission, eviction, and multi‑tier storage.
Feature Admission/Eviction
Admission is based on counters, eviction on global steps; default is to disable EV for all features except high‑dimensional sparse ones.
Unadmitted Features
Unadmitted features share the same initialization as admitted ones but are not updated until they become admitted; serving returns a default value of 0.
EV Analysis
DeepRec provides an analysis component that reports feature names, IDs, vectors, update frequencies and recent steps after each training run, aiding parameter tuning.
Multi‑Level Feature Storage
Embedding parameters can be stored across HBM, DRAM and SSD with a cache strategy, keeping hot features in fast memory and reducing overall training memory consumption.
Real‑time Training
DeepRec supports incremental model export, delivering only changed parameters (few KB) instead of full checkpoints, enabling minute‑level online model updates.
Online Inference
Processor: libserving_processor.so provides model auto‑detection, incremental updates and SessionGroup support.
PAI‑EAS: Alibaba Cloud’s online inference service offers load testing, auto‑scaling, debugging and monitoring.
Overall Benefits
Training: GPU utilization increased by >40% and total training time reduced by >50%.
Model Deployment: In a major recommendation scenario, CTR and PTR improved by 2‑3% while latency remained stable; adding high‑dimensional IDs yielded similar gains.
Future Plans
SessionGroup for shared‑memory multi‑model inference.
Model compression and quantization to drop EV data in production.
Multi‑model and GPU inference using CUDA Multi‑Stream and CUDA Graph.
Thanks to the DeepRec community for technical support that accelerated large‑model training and inference.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
