How JD Retail Boosted Online Recommendation Inference with Distributed Heterogeneous Computing
This article details JD Retail's ad‑tech team's deep‑compute optimizations—including a distributed graph‑based heterogeneous framework, GPU‑focused inference engine enhancements, TensorBatch request aggregation, deep‑learning compiler bucket pre‑compilation, asynchronous compilation, and multi‑stream GPU processing—to overcome high‑concurrency, low‑latency online recommendation challenges.
1. Introduction
To meet the growing compute demands of increasingly complex recommendation algorithms, JD Retail's advertising technology team explored heterogeneous computing frameworks and high‑performance GPU inference optimizations for high‑concurrency, low‑latency online inference scenarios.
2. Distributed Graph Heterogeneous Computing Framework
The solution splits models into sparse (CPU) and dense (GPU) sub‑graphs, deploying them on differentiated hardware. CPU clusters handle large‑scale sparse models, while GPU clusters accelerate dense models and ultra‑long user‑behavior sequences, enabling a scalable online learning pipeline.
3. High‑Performance Inference Engine
GPU kernel launch overhead was identified as a major bottleneck. By aggregating multiple inference requests into a single batch, the number of kernel launches can be dramatically reduced, improving throughput.
3.1 TensorBatch
TensorBatch merges concurrent requests, decreasing kernel launch frequency (e.g., three requests reduced from 3000 to 1000 launches) and doubling GPU utilization.
3.2 Deep Learning Compiler
Standard XLA compilation suffers from excessive runtime overhead and memory consumption due to variable‑length recommendation features. The team introduced a bucket‑based pre‑compilation technique that partitions XLA graphs, pads variable inputs, and pre‑compiles sub‑graphs, eliminating most runtime compilation.
3.2.1 Bucket Pre‑Compilation
By dividing the model into XLA sub‑graphs and applying bucket padding, the number of distinct XLA runtimes is reduced, solving both compilation time and memory issues.
3.2.2 Asynchronous Compilation
For rare out‑of‑bucket cases, the system falls back to the original graph while asynchronously compiling the needed XLA runtime for future requests.
3.3 Multi‑Stream Computing
TensorFlow’s single GPU channel causes kernel scheduling contention. The team built a multi‑stream architecture where each stream has its own CUDA context, enabling true parallel execution of kernels across multiple concurrent requests.
4. Conclusion
By combining a distributed graph‑based heterogeneous framework with GPU‑focused inference engine optimizations—TensorBatch, deep‑learning compiler bucket pre‑compilation and async compilation, and a multi‑stream GPU architecture—JD Retail achieved significant reductions in latency and cost, scaling recommendation models to billions of parameters and delivering measurable CTR improvements across multiple business lines.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
