Artificial Intelligence 15 min read

Large‑Scale Recommendation System Training with TorchRec and Dynamic Embedding

This article explains how Tencent’s AI team leverages the PyTorch‑based TorchRec library and a custom dynamic embedding solution to train billion‑scale recommendation models efficiently, detailing the benefits of TorchRec, GPU embedding, optimized kernels, embedding partition strategies, experimental results, and practical deployment guidance.

DataFunTalk
DataFunTalk
DataFunTalk
Large‑Scale Recommendation System Training with TorchRec and Dynamic Embedding

In February 2022, the PyTorch team released TorchRec, an official recommendation library, which the Tencent AI team began testing in May and later contributed enhancements such as dynamic embedding that were merged into the main branch by September.

TorchRec offers several advantages over traditional TensorFlow recommendation frameworks: better developer experience with dynamic graphs, seamless version upgrades, faster CUDA features, and proven large‑scale production use at Meta (e.g., a 125‑billion‑parameter model on Instagram Reels).

The library introduces GPU‑resident embedding, where the embedding table is partitioned across multiple GPUs, eliminating costly CPU‑GPU data transfers and improving GPU utilization; it also supports Unified Virtual Memory (UVM) to extend effective GPU memory.

Optimized GPU kernels in TorchRec (e.g., fused embedding‑lookup kernels) exploit warp‑level primitives such as shuffle_sync to broadcast IDs within a warp, turning random memory accesses into sequential reads and achieving multiple‑fold speedups over native PyTorch kernels.

For embedding partitioning, TorchRec provides flexible strategies (row‑wise, column‑wise, table‑wise, data‑parallel) and automated planners that select the optimal scheme based on hardware bandwidth, memory, and model size.

When scaling to trillion‑parameter models, the team identified limitations of pure GPU embedding (insufficient GPU memory, lack of dynamic ID addition, migration challenges) and introduced a dynamic embedding design that combines GPU embedding with a parameter‑server (PS) for rarely accessed IDs.

The dynamic embedding workflow uses an ID transformer (implemented with a high‑performance hash map) to map global IDs to a limited GPU‑resident space; once the GPU space is full, infrequently used IDs are evicted to the PS, which acts only as a key‑value store, reducing communication overhead.

Additional innovations include a cache‑friendly LRU/LFU hybrid eviction algorithm inspired by Redis and a multi‑GPU ID transformer that gathers and broadcasts transformer state efficiently.

Experimental results show that TorchRec delivers 10‑15× speedups for models like DeepFM and DCN on hundred‑billion‑parameter workloads, and the dynamic embedding extension provides up to 3× performance gains on trillion‑scale production models compared to the original TensorFlow framework.

The authors recommend using vanilla TorchRec for sub‑hundred‑billion models and adopting the dynamic embedding extension for larger workloads to achieve both performance and migration benefits.

Recommendation systemsPyTorchDynamic EmbeddingLarge-Scale TrainingTorchRecGPU Embedding
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.