Artificial Intelligence 13 min read

GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems

The article details JD Advertising's technical challenges and solutions for large‑scale sparse recommendation models, describing GPU‑focused storage, compute and I/O optimizations for both training and low‑latency inference, including distributed pipelines, heterogeneous deployment, batch aggregation, multi‑stream execution, and compiler extensions.

JD Retail Technology

Aug 30, 2024

GPU Optimization Practices for Training and Inference in JD Advertising Recommendation Systems

Li Jian, an architect in JD Advertising, presented at DataFun Summit 2024 about the recommendation system architecture, focusing on GPU throughput and low‑latency optimization for JD's ad business.

The JD ad scenario serves millions of users with million‑QPS traffic, requiring millisecond‑level response; models have evolved from shallow DNNs to Transformer‑based networks with parameters scaling from hundreds of GB to TB, demanding tens of times more compute.

Key challenges identified were: (1) high sparsity of CTR models causing I/O bottlenecks, (2) massive sparse parameters exceeding GPU memory limits, and (3) heavy CPU usage for feature computation competing with GPU resources.

Training Optimizations addressed storage by using multi‑node, multi‑GPU training with a CPU‑based sparse parameter server to extend capacity beyond GPU memory; compute was improved via a heterogeneous pipeline that separates CPU‑intensive feature networks from GPU‑intensive model networks; I/O was mitigated by introducing a GPU‑HBM parameter server as a first‑level cache and by pipeline‑parallelizing data loading, feature extraction, and training steps.

Parameter updates use All‑to‑All for embeddings and NVLink/IB AllReduce for dense weights, while a fused Adam optimizer reduces memory accesses.

Inference Optimizations tackled three problems: variable request queue lengths, low‑latency high‑concurrency demands, and complex multi‑behavior models. Solutions included the TensorBatch scheme that dynamically balances batch size and computation cost, multi‑stream execution by extending TensorFlow's device layer with multiple CUDA streams and contexts, and leveraging NVIDIA MPS to minimize context‑switch overhead.

To overcome limitations of existing deep‑learning compilers, JD extended compiler capabilities with graph‑partitioned pre‑compilation (bucketed sub‑graphs) and asynchronous compilation for long‑tail traffic, enabling fast, low‑latency inference despite dynamic input dimensions.

The presentation concluded that future work will focus on further I/O reductions, tensor‑parallel and pipeline‑parallel inference, and continued integration of sparse and dense modeling to meet growing model scale and performance requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems TensorFlow Recommendation Systems GPU Optimization inference Training sparse models

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.