Optimizing TorchRec for Large‑Scale Recommendation Systems on PyTorch
This article details the performance‑focused optimizations applied to TorchRec, PyTorch's large‑scale recommendation system library, including CUDA graph capture, multithreaded kernel launches, pinned memory copies, and input‑distribution refinements that together achieve a 2.25× speedup on MLPerf DLRM‑DCNv2 across 16 DGX H100 nodes.
TorchRec is an official PyTorch library that enables large‑scale embedding training for recommendation systems; the optimization goals target the high‑scalability MLPerf DLRM benchmark on 16 DGX H100 nodes while preserving TorchRec's API and architecture.
The library consists of three layers: a user‑friendly API layer, an internal Python module layer handling sharding (BaseSparseFeatureDist, BaseEmbeddingDist, BaseEmbeddingLookup), and a low‑level C++ FBGM layer providing efficient sparse operators.
Typical usage follows three steps: (1) construct embeddings with EmbeddingBagConfig and EmbeddingBagCollection , (2) apply DistributedModelParallel for model‑parallel sharding of embeddings, and (3) build a training pipeline with TrainPipelineSparseDist to enable pipelining and prefetching.
Performance measurements show that the original TorchRec iteration time of 7.6 ms was reduced to roughly 3.4 ms after optimization, a 2.25× acceleration, approaching the current MLPerf world record of 2.3 ms.
The training timeline is broken into five stages: input feature distribution, embedding forward (including all‑to‑all communication), MLP forward/backward with gradient all‑reduce, embedding backward, and MLP update. Stages 2‑5 can be overlapped via pipelining, while stage 1 (input distribution) is independent and can be prefetched for the next iteration.
Optimizations are grouped into two categories. The first reduces CPU launch latency by (a) capturing the MLP and all‑reduce kernels with CUDA graphs, (b) launching the input‑distribution kernels on a separate thread to overlap with later kernels, (c) using pinned‑memory copies for D2H transfers, and (d) disabling NCCL’s record_stream via the TORCH_NCCL_AVOID_RECORD_STREAMS environment variable to avoid unnecessary polling.
The second category streamlines the input‑distribution path by eliminating an unnecessary all‑to‑all operation and its associated D2H + sync, and by reusing the previous iteration’s data when the batch size remains unchanged.
Combined, these techniques substantially lower CPU overhead and hide input‑distribution latency, delivering consistent speedups across the pipeline without altering TorchRec’s high‑level API.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.