Artificial Intelligence 13 min read

GPU Throughput and Low‑Latency Optimization Practices in JD Advertising

This article presents JD Advertising's technical practices for improving GPU throughput and reducing latency in large‑scale recommendation scenarios, covering system challenges, storage and compute optimizations for training, low‑latency inference techniques, and compiler extensions to handle massive sparse models.

DataFunSummit

Aug 8, 2024

GPU Throughput and Low‑Latency Optimization Practices in JD Advertising

The presentation focuses on JD Advertising's business scenarios and shares practical work on GPU throughput and low‑latency optimization for large‑scale recommendation systems.

JD Advertising serves hundreds of thousands of QPS with millisecond‑level response requirements; models have evolved from shallow DNNs to Transformer‑based networks with parameters growing from hundreds of GB to TB, using TensorFlow for training and inference.

Key challenges include high sparsity causing I/O bottlenecks, model size exceeding GPU memory limits, and CPU‑GPU resource contention during feature computation.

Training Scenario – Storage Challenge: Multi‑node, multi‑GPU training is combined with a custom CPU‑DRAM sparse‑parameter server to extend model capacity beyond GPU memory.

Training Scenario – Compute Challenge: A heterogeneous pipeline splits CPU‑intensive feature computation from GPU‑intensive model training, deploying them on separate clusters and using distributed pipeline parallelism to balance load and improve GPU utilization.

Training Scenario – I/O Challenge: Embedding I/O accounts for over 30% of training time; a GPU‑HBM parameter server is introduced as a first‑level cache, while a CPU server acts as a second‑level cache. All‑to‑All and All‑Reduce communications are used for sparse and dense parameters respectively, and a fused Adam optimizer reduces memory accesses.

Inference Scenario – Low‑Latency Challenge: A custom TensorBatch method aggregates online requests to balance throughput and latency, while multi‑stream TensorFlow extensions (multiple CUDA streams, multiple CUDA contexts, and NVIDIA MPS) enable concurrent request processing.

Compiler Extensions: The deep‑learning compiler is enhanced with graph partitioning and bucket pre‑compilation to limit the number of compiled artifacts, and asynchronous compilation handles long‑tail traffic without blocking online inference.

In summary, JD Advertising emphasizes I/O optimizations (kernel launch, host‑device transfers, and network communication) for massive sparse models and plans to further integrate dense‑sparse workloads, tensor and pipeline parallelism, and advanced GPU utilization in future systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Advertising AI TensorFlow low latency

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.