GPU Throughput and Low‑Latency Optimization Practices in JD Advertising
This article presents JD Advertising's technical practices for improving GPU throughput and reducing latency in large‑scale recommendation scenarios, covering system challenges, storage and compute optimizations for training, low‑latency inference techniques, and compiler extensions to handle massive sparse models.
The presentation focuses on JD Advertising's business scenarios and shares practical work on GPU throughput and low‑latency optimization for large‑scale recommendation systems.
JD Advertising serves hundreds of thousands of QPS with millisecond‑level response requirements; models have evolved from shallow DNNs to Transformer‑based networks with parameters growing from hundreds of GB to TB, using TensorFlow for training and inference.
Key challenges include high sparsity causing I/O bottlenecks, model size exceeding GPU memory limits, and CPU‑GPU resource contention during feature computation.
Training Scenario – Storage Challenge: Multi‑node, multi‑GPU training is combined with a custom CPU‑DRAM sparse‑parameter server to extend model capacity beyond GPU memory.
Training Scenario – Compute Challenge: A heterogeneous pipeline splits CPU‑intensive feature computation from GPU‑intensive model training, deploying them on separate clusters and using distributed pipeline parallelism to balance load and improve GPU utilization.
Training Scenario – I/O Challenge: Embedding I/O accounts for over 30% of training time; a GPU‑HBM parameter server is introduced as a first‑level cache, while a CPU server acts as a second‑level cache. All‑to‑All and All‑Reduce communications are used for sparse and dense parameters respectively, and a fused Adam optimizer reduces memory accesses.
Inference Scenario – Low‑Latency Challenge: A custom TensorBatch method aggregates online requests to balance throughput and latency, while multi‑stream TensorFlow extensions (multiple CUDA streams, multiple CUDA contexts, and NVIDIA MPS) enable concurrent request processing.
Compiler Extensions: The deep‑learning compiler is enhanced with graph partitioning and bucket pre‑compilation to limit the number of compiled artifacts, and asynchronous compilation handles long‑tail traffic without blocking online inference.
In summary, JD Advertising emphasizes I/O optimizations (kernel launch, host‑device transfers, and network communication) for massive sparse models and plans to further integrate dense‑sparse workloads, tensor and pipeline parallelism, and advanced GPU utilization in future systems.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.