How DeepRec Boosted Sparse Model Training and Inference for Large‑Scale Recommendations
This article details how the metaapp advertising team adopted Alibaba Cloud's open‑source DeepRec to overcome parameter‑server bottlenecks, compress terabyte‑scale embeddings, leverage GPU‑accelerated distributed training, and build a low‑maintenance, high‑performance inference service using DeepRec's Processor and oneDNN optimizations.
Background
Large‑scale recommendation models in China have used custom parameter‑server systems for years. Early solutions combined a self‑developed PS with TensorFlow/PyTorch workers, but native TensorFlow performed poorly on massive id‑embedding workloads.
Business Context
The metaapp advertising R&D team previously trained TB‑scale models with TensorFlow + custom PS, incurring high iteration and maintenance costs.
After evaluating Alibaba Cloud PAI’s open‑source DeepRec (derived from PAI‑TF), they adopted it for both training and online inference, achieving significant performance gains and cost reductions.
3.1 EmbeddingVariable Multi‑Level Storage
Embedding sizes can reach terabytes, making pure‑memory storage impractical. DeepRec provides a multi‑level storage mechanism that places hot embeddings in GPU/CPU memory and colder ones on PMEM or SSD.
3.1.1 Compaction Performance Issues
Using a LevelDB‑like SSD store caused frequent compactions, leading to write amplification and severe read latency on the PS side.
3.1.2 DeepRec Solution
DeepRec replaced the LevelDB backend with SSDHASH, offering both synchronous and asynchronous compaction, which dramatically improved read performance during training.
3.1.3 Model Size Compression
By applying binary‑code multihash compression to uid embeddings, model size was reduced from ~800 GB to <40 GB with only ~0.3 % AUC loss, enabling full‑model placement in GPU memory.
3.2 GPU‑Based Distributed Training
With the PS bottleneck removed, training speed scales with compute. DeepRec’s HybridBackend (HB) supports GPU‑accelerated distributed training and integrates tightly with DeepRec.
HB also resolves data loss in multi‑GPU training by loading all data groups on each worker and balancing batches.
4.1 Inference Pain Points
Online inference suffers from high maintenance cost due to numerous model versions and manual resource scaling for A/B testing.
4.2 Processer‑Based Inference Solution
Deploy a single large‑spec machine (e.g., 128 CPU cores, 512 GB RAM) to host all model instances, managed by a serving‑proxy that automates model lifecycle.
DeepRec’s Serving Processor library (a shared‑object) is integrated into a custom Go RPC framework for low‑latency serving.
4.2.3 SessionGroup
SessionGroup configures multiple TensorFlow sessions with round‑robin request routing, isolating thread pools per session while sharing Variables, yielding ~50 % inference speedup.
4.2.4 oneDNN Optimization
DeepRec incorporates Intel oneDNN with a unified Eigen thread pool, enabling BF16 support on Xeon Scalable CPUs and delivering up to 10 % end‑to‑end acceleration.
4.2.5 Sub‑graph Fusion
Manual fusion of Reshape‑containing sub‑graphs combined with oneDNN kernels reduces operator overhead and cuts CPU usage by 10 %.
4.2.6 Cost Model Design
A simple cost model allocates dedicated cores to baseline models and shares remaining cores among A/B test models, improving overall performance by ~30 %.
Future Plans
Develop a dynamic cost‑model optimizer and open‑source the inference architecture built on DeepRec Processor.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
