Artificial Intelligence 14 min read

How JD Retail Boosted Online Recommendation Inference with Distributed Heterogeneous Computing

This article details JD Retail's ad‑tech team's deep‑compute optimizations—including a distributed graph‑based heterogeneous framework, GPU‑focused inference engine enhancements, TensorBatch request aggregation, deep‑learning compiler bucket pre‑compilation, asynchronous compilation, and multi‑stream GPU processing—to overcome high‑concurrency, low‑latency online recommendation challenges.

JD Cloud Developers

Mar 14, 2024

How JD Retail Boosted Online Recommendation Inference with Distributed Heterogeneous Computing

1. Introduction

To meet the growing compute demands of increasingly complex recommendation algorithms, JD Retail's advertising technology team explored heterogeneous computing frameworks and high‑performance GPU inference optimizations for high‑concurrency, low‑latency online inference scenarios.

2. Distributed Graph Heterogeneous Computing Framework

The solution splits models into sparse (CPU) and dense (GPU) sub‑graphs, deploying them on differentiated hardware. CPU clusters handle large‑scale sparse models, while GPU clusters accelerate dense models and ultra‑long user‑behavior sequences, enabling a scalable online learning pipeline.

3. High‑Performance Inference Engine

GPU kernel launch overhead was identified as a major bottleneck. By aggregating multiple inference requests into a single batch, the number of kernel launches can be dramatically reduced, improving throughput.

3.1 TensorBatch

TensorBatch merges concurrent requests, decreasing kernel launch frequency (e.g., three requests reduced from 3000 to 1000 launches) and doubling GPU utilization.

3.2 Deep Learning Compiler

Standard XLA compilation suffers from excessive runtime overhead and memory consumption due to variable‑length recommendation features. The team introduced a bucket‑based pre‑compilation technique that partitions XLA graphs, pads variable inputs, and pre‑compiles sub‑graphs, eliminating most runtime compilation.

3.2.1 Bucket Pre‑Compilation

By dividing the model into XLA sub‑graphs and applying bucket padding, the number of distinct XLA runtimes is reduced, solving both compilation time and memory issues.

3.2.2 Asynchronous Compilation

For rare out‑of‑bucket cases, the system falls back to the original graph while asynchronously compiling the needed XLA runtime for future requests.

3.3 Multi‑Stream Computing

TensorFlow’s single GPU channel causes kernel scheduling contention. The team built a multi‑stream architecture where each stream has its own CUDA context, enabling true parallel execution of kernels across multiple concurrent requests.

4. Conclusion

By combining a distributed graph‑based heterogeneous framework with GPU‑focused inference engine optimizations—TensorBatch, deep‑learning compiler bucket pre‑compilation and async compilation, and a multi‑stream GPU architecture—JD Retail achieved significant reductions in latency and cost, scaling recommendation models to billions of parameters and delivering measurable CTR improvements across multiple business lines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU inference Distributed Computing Deep Learning Compiler heterogeneous architecture multi‑stream processing online recommendation

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.