Artificial Intelligence 15 min read

Large‑Scale Recommendation System Training with TorchRec and Dynamic Embedding

This article explains how Tencent’s AI team leverages the PyTorch‑based TorchRec library and a custom dynamic embedding solution to train billion‑scale recommendation models efficiently, detailing the benefits of TorchRec, GPU embedding, optimized kernels, embedding partition strategies, experimental results, and practical deployment guidance.

DataFunTalk

Apr 3, 2023

Large‑Scale Recommendation System Training with TorchRec and Dynamic Embedding

In February 2022, the PyTorch team released TorchRec, an official recommendation library, which the Tencent AI team began testing in May and later contributed enhancements such as dynamic embedding that were merged into the main branch by September.

TorchRec offers several advantages over traditional TensorFlow recommendation frameworks: better developer experience with dynamic graphs, seamless version upgrades, faster CUDA features, and proven large‑scale production use at Meta (e.g., a 125‑billion‑parameter model on Instagram Reels).

The library introduces GPU‑resident embedding, where the embedding table is partitioned across multiple GPUs, eliminating costly CPU‑GPU data transfers and improving GPU utilization; it also supports Unified Virtual Memory (UVM) to extend effective GPU memory.

Optimized GPU kernels in TorchRec (e.g., fused embedding‑lookup kernels) exploit warp‑level primitives such as shuffle_sync to broadcast IDs within a warp, turning random memory accesses into sequential reads and achieving multiple‑fold speedups over native PyTorch kernels.

For embedding partitioning, TorchRec provides flexible strategies (row‑wise, column‑wise, table‑wise, data‑parallel) and automated planners that select the optimal scheme based on hardware bandwidth, memory, and model size.

When scaling to trillion‑parameter models, the team identified limitations of pure GPU embedding (insufficient GPU memory, lack of dynamic ID addition, migration challenges) and introduced a dynamic embedding design that combines GPU embedding with a parameter‑server (PS) for rarely accessed IDs.

The dynamic embedding workflow uses an ID transformer (implemented with a high‑performance hash map) to map global IDs to a limited GPU‑resident space; once the GPU space is full, infrequently used IDs are evicted to the PS, which acts only as a key‑value store, reducing communication overhead.

Additional innovations include a cache‑friendly LRU/LFU hybrid eviction algorithm inspired by Redis and a multi‑GPU ID transformer that gathers and broadcasts transformer state efficiently.

Experimental results show that TorchRec delivers 10‑15× speedups for models like DeepFM and DCN on hundred‑billion‑parameter workloads, and the dynamic embedding extension provides up to 3× performance gains on trillion‑scale production models compared to the original TensorFlow framework.

The authors recommend using vanilla TorchRec for sub‑hundred‑billion models and adopting the dynamic embedding extension for larger workloads to achieve both performance and migration benefits.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

PyTorch dynamic embedding Large‑Scale Training TorchRec GPU Embedding

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.