How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Billion‑Scale

This article details JD Retail’s 9N‑LLM unified training engine, explaining the background of generative recommendation, the challenges of massive sparse and dense parameters, and the multi‑framework, multi‑hardware solutions—including efficient sample processing, large‑scale sparse embedding, dense scaling, UniAttention acceleration, and reinforcement‑learning integration—that enable industrial‑scale deployment.

JD Tech Talk
JD Tech Talk
JD Tech Talk
How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Billion‑Scale

1. Background and Challenges

Traditional deep‑learning recommendation models have reached limits in feature engineering, user intent modeling, cascade error amplification, and compute utilization, creating bottlenecks for further performance gains. The rise of large language models (LLMs) has sparked interest in generative recommendation (GR), which reframes recommendation as a sequence‑generation task, offering a path to break these bottlenecks. However, GR introduces new demands: massive sample storage (TB‑PB), heterogeneous hardware (GPU/NPU), and complex training pipelines (pre‑training, supervised fine‑tuning, reinforcement learning).

2. 9N‑LLM Generative Recommendation Training Framework

JD’s 9N‑LLM is a unified training engine that supports TensorFlow and PyTorch, GPU and NPU, and both traditional and generative recommendation scenarios. It integrates a large‑scale sparse embedding engine, a custom UniAttention acceleration library, and a Ray‑based reinforcement‑learning (RL) framework, enabling end‑to‑end training of models with up to 10 TB of sparse parameters and billions of dense parameters.

Traditional vs. Generative Recommendation Paradigm
Traditional vs. Generative Recommendation Paradigm

3. Efficient Sample Engine

The sample engine is built on Ray Data to decouple data loading from model computation, supporting horizontal scaling of nodes. It vectorizes preprocessing operators and stores tokenized sequences in a KV system to reduce storage costs. Sample handling is row‑granular, providing lossless checkpoint‑resume capabilities.

Key techniques include:

Parquet columnar storage for high compression and SIMD‑accelerated decompression.

Chainable vectorized consumption: dataset.map().prefetch().take() with Arrow RecordBatch for fast processing.

Vectorized DataLoader: batch‑level processing replaces per‑sample handling, reducing inter‑process data transfer.

4. Large‑Scale Sparse Distributed Engine

To train TB‑scale sparse embeddings, 9N‑LLM employs a multi‑level cache and a five‑stage pipeline. Sparse parameters are sharded across nodes (Host Memory KV store) while dense parameters are replicated via AllReduce. Embedding lookup, gradient aggregation, and updates use GPU Direct RDMA and a symmetric memory model, achieving 1.14‑2.44× the performance of open‑source SOTA solutions.

Distributed Hierarchical KV Storage
Distributed Hierarchical KV Storage

The pipeline stages are:

Data Prefetch – sample read and parse.

Data H2D – copy parsed data to device memory.

Input Dist – bucket keys and All2All distribute them.

Emb Lookup – query embeddings on host, copy to device.

Fwd/Bwd/Opt – All2All sync of embeddings, forward/backward pass, gradient All2All, and write‑back to host.

5. Dense Scaling Engine

Leveraging PyTorch and LLM‑style attention, 9N‑LLM reuses proven LLM training optimizations (e.g., mixed‑precision, gradient accumulation) for dense parameters. Table‑style performance data (omitted) shows up to 70% speed‑up over FlexAttention and 3× over Compile.

6. UniAttention Acceleration Library

Standard FlashAttention cannot handle the irregular masks of generative recommendation. UniAttention, built with Triton, Cutlass, and Tilelang, introduces a Compute/Mask dual‑interval scheduling and exploits Hopper’s wgmma asynchronous instructions. This yields 70%‑300% performance gains on complex mask patterns.

UniAttention Performance
UniAttention Performance

7. Reinforcement Learning Capability

After pre‑training and supervised fine‑tuning, RL is introduced to let the model explore better recommendation policies. 9N‑LLM builds the RL pipeline on Ray, providing:

Multi‑scenario compatibility (LLM, multimodal, generative recommendation, multi‑agent).

Flexible resource scheduling with collocated and disaggregated modes.

Highly customizable workers for custom data, models, and reward services.

Ray‑Based RL Framework
Ray‑Based RL Framework

8. Conclusion

Generative recommendation opens new opportunities but demands infrastructure that scales linearly with compute and data. JD’s 9N‑LLM addresses sample efficiency, cross‑framework compatibility, massive sparse‑dense parameter handling, and RL integration, positioning it as a production‑ready solution for billion‑scale recommendation tasks.

AI infrastructureSparse EmbeddingGenerative RecommendationLarge-Scale Training
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.