How DeepRec Boosted Sparse Model Training and Inference for Large‑Scale Recommendations

This article details how the metaapp advertising team adopted Alibaba Cloud's open‑source DeepRec to overcome parameter‑server bottlenecks, compress terabyte‑scale embeddings, leverage GPU‑accelerated distributed training, and build a low‑maintenance, high‑performance inference service using DeepRec's Processor and oneDNN optimizations.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How DeepRec Boosted Sparse Model Training and Inference for Large‑Scale Recommendations

Background

Large‑scale recommendation models in China have used custom parameter‑server systems for years. Early solutions combined a self‑developed PS with TensorFlow/PyTorch workers, but native TensorFlow performed poorly on massive id‑embedding workloads.

Typical distributed worker‑ps architecture
Typical distributed worker‑ps architecture

Business Context

The metaapp advertising R&D team previously trained TB‑scale models with TensorFlow + custom PS, incurring high iteration and maintenance costs.

After evaluating Alibaba Cloud PAI’s open‑source DeepRec (derived from PAI‑TF), they adopted it for both training and online inference, achieving significant performance gains and cost reductions.

3.1 EmbeddingVariable Multi‑Level Storage

Embedding sizes can reach terabytes, making pure‑memory storage impractical. DeepRec provides a multi‑level storage mechanism that places hot embeddings in GPU/CPU memory and colder ones on PMEM or SSD.

3.1.1 Compaction Performance Issues

Using a LevelDB‑like SSD store caused frequent compactions, leading to write amplification and severe read latency on the PS side.

3.1.2 DeepRec Solution

DeepRec replaced the LevelDB backend with SSDHASH, offering both synchronous and asynchronous compaction, which dramatically improved read performance during training.

3.1.3 Model Size Compression

By applying binary‑code multihash compression to uid embeddings, model size was reduced from ~800 GB to <40 GB with only ~0.3 % AUC loss, enabling full‑model placement in GPU memory.

3.2 GPU‑Based Distributed Training

With the PS bottleneck removed, training speed scales with compute. DeepRec’s HybridBackend (HB) supports GPU‑accelerated distributed training and integrates tightly with DeepRec.

HybridBackend training speed comparison
HybridBackend training speed comparison

HB also resolves data loss in multi‑GPU training by loading all data groups on each worker and balancing batches.

4.1 Inference Pain Points

Online inference suffers from high maintenance cost due to numerous model versions and manual resource scaling for A/B testing.

4.2 Processer‑Based Inference Solution

Deploy a single large‑spec machine (e.g., 128 CPU cores, 512 GB RAM) to host all model instances, managed by a serving‑proxy that automates model lifecycle.

DeepRec’s Serving Processor library (a shared‑object) is integrated into a custom Go RPC framework for low‑latency serving.

4.2.3 SessionGroup

SessionGroup configures multiple TensorFlow sessions with round‑robin request routing, isolating thread pools per session while sharing Variables, yielding ~50 % inference speedup.

4.2.4 oneDNN Optimization

DeepRec incorporates Intel oneDNN with a unified Eigen thread pool, enabling BF16 support on Xeon Scalable CPUs and delivering up to 10 % end‑to‑end acceleration.

4.2.5 Sub‑graph Fusion

Manual fusion of Reshape‑containing sub‑graphs combined with oneDNN kernels reduces operator overhead and cuts CPU usage by 10 %.

4.2.6 Cost Model Design

A simple cost model allocates dedicated cores to baseline models and shares remaining cores among A/B test models, improving overall performance by ~30 %.

Future Plans

Develop a dynamic cost‑model optimizer and open‑source the inference architecture built on DeepRec Processor.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

GPU inferencedistributed trainingHybridBackendDeepRecEmbeddingVariableoneDNN
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.