Artificial Intelligence 11 min read

Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch

This article details the performance‑focused optimizations applied to TorchRec, PyTorch's large‑scale recommendation system library, including CUDA graph capture, multithreaded kernel launches, pinned memory copies, and input‑distribution refinements that together achieve a 2.25× speedup on MLPerf DLRM‑DCNv2 across 16 DGX H100 nodes.

DataFunSummit

Oct 5, 2024

Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch

TorchRec is an official PyTorch library that enables large‑scale embedding training for recommendation systems; the optimization goals target the high‑scalability MLPerf DLRM benchmark on 16 DGX H100 nodes while preserving TorchRec's API and architecture.

The library consists of three layers: a user‑friendly API layer, an internal Python module layer handling sharding (BaseSparseFeatureDist, BaseEmbeddingDist, BaseEmbeddingLookup), and a low‑level C++ FBGM layer providing efficient sparse operators.

Typical usage follows three steps: (1) construct embeddings with EmbeddingBagConfig and EmbeddingBagCollection, (2) apply DistributedModelParallel for model‑parallel sharding of embeddings, and (3) build a training pipeline with TrainPipelineSparseDist to enable pipelining and prefetching.

Performance measurements show that the original TorchRec iteration time of 7.6 ms was reduced to roughly 3.4 ms after optimization, a 2.25× acceleration, approaching the current MLPerf world record of 2.3 ms.

The training timeline is broken into five stages: input feature distribution, embedding forward (including all‑to‑all communication), MLP forward/backward with gradient all‑reduce, embedding backward, and MLP update. Stages 2‑5 can be overlapped via pipelining, while stage 1 (input distribution) is independent and can be prefetched for the next iteration.

Optimizations are grouped into two categories. The first reduces CPU launch latency by (a) capturing the MLP and all‑reduce kernels with CUDA graphs, (b) launching the input‑distribution kernels on a separate thread to overlap with later kernels, (c) using pinned‑memory copies for D2H transfers, and (d) disabling NCCL’s record_stream via the TORCH_NCCL_AVOID_RECORD_STREAMS environment variable to avoid unnecessary polling.

The second category streamlines the input‑distribution path by eliminating an unnecessary all‑to‑all operation and its associated D2H + sync, and by reusing the previous iteration’s data when the batch size remains unchanged.

Combined, these techniques substantially lower CPU overhead and hide input‑distribution latency, delivering consistent speedups across the pipeline without altering TorchRec’s high‑level API.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Recommendation Systems PyTorch CUDA Graph distributed training GPU Optimization TorchRec

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.