How DeepRec Cut Ximalaya AI Cloud Training Time by 50% and Boosted CTR

Ximalaya’s AI Cloud platform leverages Alibaba’s DeepRec to tackle high‑dimensional sparse feature challenges, accelerate model training by over 50%, enable minute‑level model updates, and improve recommendation metrics, while outlining implementation details, multi‑tier storage, real‑time training, and future inference enhancements.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How DeepRec Cut Ximalaya AI Cloud Training Time by 50% and Boosted CTR

Business Overview

Ximalaya app recommends content through scenarios such as Daily Must-Listen, Hot Today, Private FM, etc. Ximalaya AI Cloud provides a one‑stop algorithm platform covering data, features, models and services, offering visual modeling and component‑based pipelines that let users build data→feature→sample→model→service workflows without coding.

Collaboration Background

Rapid growth in algorithm capability and search advertising drove the platform from traditional machine learning to deep learning, requiring larger sample sizes, feature dimensions and model complexity. The stack uses Spark for data processing, Kubernetes for GPU scheduling, and TensorFlow for training, but faces two main pain points:

High‑dimensional Sparse Feature Support

Hash collision: Simple hashing leads to >20% collision at tens of millions of buckets; a multi‑hash scheme reduces it to 0.2‰ with 95% fewer parameters, though it expands long‑sequence IDs.

Feature admission/eviction/variable length: Proper configuration shrinks model size and stabilizes metrics, enabling training of models with billions of feature dimensions.

Rapid Model Iteration

Minute‑level updates: Current daily model updates are insufficient for high‑frequency scenarios; data back‑flow, processing, training and upload times limit speed, and full‑model deployments increase latency.

To address these, the team evaluated Alibaba DeepRec, Tencent TFRA/DynamicEmbedding, Nvidia HugeCTR and chose Alibaba’s open‑source DeepRec.

DeepRec Deployment

High‑dimensional Sparse Features

DeepRec’s EmbeddingVariable uses a dynamic hash‑map‑like structure, allowing elastic feature counts, reducing conflicts and memory usage. It supports feature admission, eviction, and multi‑tier storage.

Feature Admission/Eviction

Admission is based on counters, eviction on global steps; default is to disable EV for all features except high‑dimensional sparse ones.

Unadmitted Features

Unadmitted features share the same initialization as admitted ones but are not updated until they become admitted; serving returns a default value of 0.

EV Analysis

DeepRec provides an analysis component that reports feature names, IDs, vectors, update frequencies and recent steps after each training run, aiding parameter tuning.

Multi‑Level Feature Storage

Embedding parameters can be stored across HBM, DRAM and SSD with a cache strategy, keeping hot features in fast memory and reducing overall training memory consumption.

Real‑time Training

DeepRec supports incremental model export, delivering only changed parameters (few KB) instead of full checkpoints, enabling minute‑level online model updates.

Online Inference

Processor: libserving_processor.so provides model auto‑detection, incremental updates and SessionGroup support.

PAI‑EAS: Alibaba Cloud’s online inference service offers load testing, auto‑scaling, debugging and monitoring.

Overall Benefits

Training: GPU utilization increased by >40% and total training time reduced by >50%.

Model Deployment: In a major recommendation scenario, CTR and PTR improved by 2‑3% while latency remained stable; adding high‑dimensional IDs yielded similar gains.

Future Plans

SessionGroup for shared‑memory multi‑model inference.

Model compression and quantization to drop EV data in production.

Multi‑model and GPU inference using CUDA Multi‑Stream and CUDA Graph.

Thanks to the DeepRec community for technical support that accelerated large‑model training and inference.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI cloudDeepRecModel Training Optimizationembedding variablehigh-dimensional sparse features
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.