How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Billion‑Scale
This article details JD Retail’s 9N‑LLM unified training engine, explaining the background of generative recommendation, the challenges of massive sparse and dense parameters, and the multi‑framework, multi‑hardware solutions—including efficient sample processing, large‑scale sparse embedding, dense scaling, UniAttention acceleration, and reinforcement‑learning integration—that enable industrial‑scale deployment.
1. Background and Challenges
Traditional deep‑learning recommendation models have reached limits in feature engineering, user intent modeling, cascade error amplification, and compute utilization, creating bottlenecks for further performance gains. The rise of large language models (LLMs) has sparked interest in generative recommendation (GR), which reframes recommendation as a sequence‑generation task, offering a path to break these bottlenecks. However, GR introduces new demands: massive sample storage (TB‑PB), heterogeneous hardware (GPU/NPU), and complex training pipelines (pre‑training, supervised fine‑tuning, reinforcement learning).
2. 9N‑LLM Generative Recommendation Training Framework
JD’s 9N‑LLM is a unified training engine that supports TensorFlow and PyTorch, GPU and NPU, and both traditional and generative recommendation scenarios. It integrates a large‑scale sparse embedding engine, a custom UniAttention acceleration library, and a Ray‑based reinforcement‑learning (RL) framework, enabling end‑to‑end training of models with up to 10 TB of sparse parameters and billions of dense parameters.
3. Efficient Sample Engine
The sample engine is built on Ray Data to decouple data loading from model computation, supporting horizontal scaling of nodes. It vectorizes preprocessing operators and stores tokenized sequences in a KV system to reduce storage costs. Sample handling is row‑granular, providing lossless checkpoint‑resume capabilities.
Key techniques include:
Parquet columnar storage for high compression and SIMD‑accelerated decompression.
Chainable vectorized consumption: dataset.map().prefetch().take() with Arrow RecordBatch for fast processing.
Vectorized DataLoader: batch‑level processing replaces per‑sample handling, reducing inter‑process data transfer.
4. Large‑Scale Sparse Distributed Engine
To train TB‑scale sparse embeddings, 9N‑LLM employs a multi‑level cache and a five‑stage pipeline. Sparse parameters are sharded across nodes (Host Memory KV store) while dense parameters are replicated via AllReduce. Embedding lookup, gradient aggregation, and updates use GPU Direct RDMA and a symmetric memory model, achieving 1.14‑2.44× the performance of open‑source SOTA solutions.
The pipeline stages are:
Data Prefetch – sample read and parse.
Data H2D – copy parsed data to device memory.
Input Dist – bucket keys and All2All distribute them.
Emb Lookup – query embeddings on host, copy to device.
Fwd/Bwd/Opt – All2All sync of embeddings, forward/backward pass, gradient All2All, and write‑back to host.
5. Dense Scaling Engine
Leveraging PyTorch and LLM‑style attention, 9N‑LLM reuses proven LLM training optimizations (e.g., mixed‑precision, gradient accumulation) for dense parameters. Table‑style performance data (omitted) shows up to 70% speed‑up over FlexAttention and 3× over Compile.
6. UniAttention Acceleration Library
Standard FlashAttention cannot handle the irregular masks of generative recommendation. UniAttention, built with Triton, Cutlass, and Tilelang, introduces a Compute/Mask dual‑interval scheduling and exploits Hopper’s wgmma asynchronous instructions. This yields 70%‑300% performance gains on complex mask patterns.
7. Reinforcement Learning Capability
After pre‑training and supervised fine‑tuning, RL is introduced to let the model explore better recommendation policies. 9N‑LLM builds the RL pipeline on Ray, providing:
Multi‑scenario compatibility (LLM, multimodal, generative recommendation, multi‑agent).
Flexible resource scheduling with collocated and disaggregated modes.
Highly customizable workers for custom data, models, and reward services.
8. Conclusion
Generative recommendation opens new opportunities but demands infrastructure that scales linearly with compute and data. JD’s 9N‑LLM addresses sample efficiency, cross‑framework compatibility, massive sparse‑dense parameter handling, and RL integration, positioning it as a production‑ready solution for billion‑scale recommendation tasks.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
