How JD's 9N‑LLM Engine Powers Scalable Generative Recommendation at Massive Scale
This article details JD Retail's 9N‑LLM unified training framework that tackles the massive data, hardware heterogeneity, and algorithmic challenges of generative recommendation by integrating TensorFlow and PyTorch, supporting GPU/NPU, and delivering high‑throughput sample processing, sparse/dense optimization, and flexible reinforcement‑learning capabilities.
Introduction
Generative recommendation is an emerging paradigm that reshapes recommendation systems by treating the problem as a sequence‑generation task, offering higher diversity and breaking performance ceilings, but it also brings new training requirements.
Background and Challenges
Traditional deep‑learning recommenders face limits in feature engineering, user‑intent modeling, error amplification in cascade architectures, and low compute utilization. Scaling LLM‑style generative models introduces challenges: massive sample size (TB‑PB), heterogeneous hardware (GPU/NPU), cross‑framework integration (TensorFlow vs PyTorch), and complex reinforcement‑learning pipelines.
9N‑LLM Unified Training Engine
The JD Retail 9N‑LLM engine unifies TensorFlow and PyTorch, supports GPU and NPU, and handles both traditional and generative recommendation scenarios. Core components include a large‑scale sparse embedding engine, a custom UniAttention library, and a Ray‑based RL training framework, enabling end‑to‑end training of models with up to 10 TB sparse and 10 B dense parameters.
Efficient Sample Engine
Built on Ray Data, the sample engine decouples data preprocessing from model computation, uses column‑store Parquet files, vectorized pipelines ( dataset.map().prefetch().take()), and dynamic feature stitching via HBase to reduce storage and improve I/O throughput. It also provides row‑level checkpointing for loss‑less resumption during elastic scaling.
Large‑Scale Sparse Distributed Engine
A five‑stage pipeline (Data Prefetch → Data H2D → Input Dist → Embedding Lookup → Fwd/Bwd/Opt) together with a multi‑level MEM‑HBM cache and All‑to‑All communication achieves 10 TB‑level sparse embedding training with 1.14‑2.44× performance over open‑source SOTA. The engine leverages GPU Direct RDMA, symmetric memory, and custom attention kernels (FlexAttention/Compile) with up to 70%/3× speedup.
Large‑Scale Dense Compute Engine
Dense parameters are synchronized via AllReduce while sparse parameters use All‑to‑All. The MEM‑HBM hierarchical KV store stores shards of embeddings in host memory and caches frequently accessed vectors in device memory, reducing latency and enabling efficient scaling.
UniAttention Acceleration Library
UniAttention, built with Triton, Cutlass and Tilelang, supports mixed‑mask attention patterns typical of generative recommendation (multiple segment masks, custom beam‑search). It employs register‑level optimizations and a compute/mask dual‑interval scheduler, delivering 70%–300% speedup over FlexAttention/Compile.
Reinforcement Learning Training
The RL stage follows a pre‑train → supervised fine‑tune → RL pipeline. 9N‑LLM uses Ray to orchestrate actors in collocated or disaggregated modes, supports dynamic sharding, custom reward models, and loss‑less checkpointing, handling TB‑scale sparse parameters and complex reward calculations.
Conclusion
9N‑LLM demonstrates that a unified, hardware‑agnostic training stack can meet the demanding requirements of industrial‑scale generative recommendation, offering high‑throughput sample processing, efficient sparse/dense computation, and flexible RL capabilities.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
