Artificial Intelligence 12 min read

Next-Generation Multi‑GPU Synchronous Training Architecture for Large‑Scale Sparse Recommendation Models

The article details JD Retail's evolution from TensorFlow‑based sparse training to a custom high‑performance parameter server and a fully GPU‑accelerated, multi‑node, multi‑card synchronous training framework that leverages GPU‑RDMA, two‑level CPU‑DRAM/GPU‑HBM caching, and pipeline parallelism to overcome storage, I/O, and compute challenges of trillion‑parameter recommendation systems.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Next-Generation Multi‑GPU Synchronous Training Architecture for Large‑Scale Sparse Recommendation Models

In recent years, the recommendation field has seen rapid growth in model size and computational complexity, prompting the need for advanced hardware and training architectures. JD Retail's advertising technology team introduced a new multi‑machine, multi‑card, fully GPU‑synchronous training architecture that utilizes GPU‑RDMA for high‑bandwidth parameter communication and a five‑stage pipeline parallelism to dramatically improve data exchange efficiency.

1. Introduction – The advertising training framework has undergone two major architectural evolutions. The first phase leveraged TensorFlow and a self‑developed high‑performance parameter server to support TB‑scale sparse models, addressing static embedding limitations. The second phase, driven by NVIDIA A100 GPUs and advanced interconnects, shifted to a soft‑hard deep integration training solution.

2. Evolving Large‑Scale Sparse Training Solutions

2.1 TB‑Scale Sparse Training with Distributed Parameter Servers

TensorFlow’s static embedding mechanism restricts parameter scale and online learning capabilities. JD’s custom dynamic‑embedding high‑performance parameter server maps embeddings to distinct memory spaces, implements two‑level retrieval, and supports high‑concurrency reads/writes, enabling efficient storage of massive sparse parameters.

Figure 1: Dynamic Embedding Parameter Server & Training Architecture

The server achieved a 25% performance boost over native TensorFlow PS, saved 15‑20% memory compared to Alibaba DeepRec and Tencent TFRA, and enabled online learning with minute‑level model freshness.

2.2 Full‑GPU Training Powered by High‑Performance Compute

Traditional parameter‑server approaches struggle with dense Transformer models, complex communication topologies, and I/O bottlenecks. Leveraging NVIDIA A100 GPUs, NVLink, and InfiniBand RDMA, JD designed a full‑GPU solution that addresses three core challenges:

Storage: Multi‑hundred‑GB sparse models exceed single‑GPU memory; a two‑level GPU‑HBM + CPU‑DRAM cache stores all parameters.

I/O: High‑throughput GPU‑to‑GPU RDMA replaces CPU‑to‑CPU TCP, increasing bandwidth from 1 GB/s to 600 GB/s.

Compute: Heterogeneous CPU‑GPU pipeline parallelism balances workloads and maximizes utilization.

Figure 4: Full‑GPU Training Architecture

The design includes a two‑level cross‑cache parameter server (GPU‑HBM as level‑1, CPU‑DRAM as level‑2) that can be extended with SSD for a three‑tier HBM‑DRAM‑SSD solution, supporting trillion‑parameter training.

Parameter synchronization uses GPU‑RDMA collective communications (AllReduce, AllToAll), achieving orders‑of‑magnitude bandwidth improvements and ensuring model accuracy.

CPU‑GPU heterogeneous pipeline parallelism partitions the model into CPU‑intensive and GPU‑intensive sub‑graphs, deploying them on a heterogeneous cluster to resolve compute imbalance and maximize throughput.

Figure 6: Distributed Pipeline Parallel Framework across CPU & GPU

Scaling across multiple machines using InfiniBand further increased training speed by a factor of 1.85+, establishing industry‑leading performance.

3. Conclusion and Outlook

The new GPU‑centric architecture has been deployed across JD’s advertising business lines, expanding CTR model size from 30 GB to 130 GB with a 55% training speed increase without additional resources, and enabling rapid model scaling to hundreds of gigabytes, boosting iteration efficiency by 400%.

Future work will focus on deeper integration of algorithms, compute, and architecture, as well as unified online‑offline designs, inviting collaborators to explore this frontier.

GPU AccelerationRecommendation systemsdistributed trainingAI infrastructureparameter serversparse embeddings
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.