Scaling Generative Recommendation: Inside JD’s 9N-LLM Multi‑Framework Training Engine

This article details JD Retail’s 9N-LLM unified training engine, which integrates TensorFlow and PyTorch across GPU and NPU hardware to tackle the massive data, model size, and reinforcement‑learning complexities of generative recommendation, offering concrete components, performance benchmarks, and future directions.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
Scaling Generative Recommendation: Inside JD’s 9N-LLM Multi‑Framework Training Engine

Background and Challenges

Traditional deep‑learning recommendation models have reached limits in feature engineering, user intent modeling, cascade error amplification, and compute utilization, creating bottlenecks for further gains. The rise of large‑language‑model (LLM) generative techniques has turned recommendation into a sequence‑generation problem, promising to break these limits. However, this new paradigm introduces new requirements: massive multi‑modal samples, TB‑scale sparse embeddings, B‑scale dense parameters, and a complex pre‑train‑fine‑tune‑RL pipeline that stresses hardware and software stacks.

9N-LLM Generative Recommendation Training Framework

JD’s 9N‑LLM engine unifies TensorFlow and PyTorch, supports both GPU and NPU, and addresses the above challenges through a set of core components:

Sample Engine : Built on Ray Data, it decouples data preprocessing from model computation, vectorizes operators, stores tokenized sequences in a KV system, and provides row‑level checkpointing for lossless recovery.

Large‑Scale Distributed Sparse Embedding Engine : Implements a multi‑level cache (Device/Host Memory), five‑stage pipeline, and All2All communication to support 10 TB‑level sparse parameters with up to 40 % MFU.

RL Training Module : Uses Ray‑based distributed actors, SingleController and DistributeWorker abstractions, and flexible collocated/disaggregated deployment to handle custom reward logic and multi‑objective optimization.

Hardware Adaptation : Seamlessly switches between GPU and NPU with near‑zero code changes.

Efficient Sample Engine

The engine reads columnar Parquet files, applies SIMD‑based decompression, and offers a chainable API such as dataset.map().prefetch().take() for vectorized preprocessing. It also transforms DataLoader to batch‑level operations, moving collator work into map(fn) to reduce inter‑process traffic.

Figure 1: Traditional vs. Generative Recommendation
Figure 1: Traditional vs. Generative Recommendation

Dynamic Sample Flexibility

Dynamic feature concatenation via HBase allows on‑the‑fly merging of policy‑generated samples with item features, eliminating costly offline joins. Fixed‑Period User Event compression aggregates multiple exposures into a single feature, reducing storage and computation while preventing feature leakage.

Figure 2: Dynamic Feature Concatenation
Figure 2: Dynamic Feature Concatenation

Lossless Checkpointing and Elastic Scaling

Row‑level identifiers track sample consumption, enabling fast skip of already processed data after failures. Dynamic sharding redistributes remaining samples across workers during elastic scaling, and sample checkpoints are synchronized with model checkpoints for seamless recovery.

Figure 3: Checkpointing Workflow
Figure 3: Checkpointing Workflow

Large‑Scale Sparse Distributed Engine

Parameters are partitioned across hosts (Host Memory) and cached in Device Memory (HBM) using a multi‑level KV store. All2All communication distributes keys, retrieves embeddings, and synchronizes gradients, achieving 1.14‑2.44× speedup over open‑source baselines. Benchmarks show embedding lookup latency reductions from 13 ms to 5 ms for small batches and from 54 ms to 32 ms for larger configurations.

Figure 4: Distributed Hierarchical KV Architecture
Figure 4: Distributed Hierarchical KV Architecture

UniAttention Acceleration Library

Standard FlashAttn cannot handle the heterogeneous masks of generative recommendation. JD developed UniAttention using Triton, Cutlass, and Tilelang, applying register‑level optimizations and a compute/mask dual‑interval scheduler. This yields 70 %–300 % performance gains over FlexAttention/Compile and fully exploits Hopper’s wgmma instructions.

Figure 5: UniAttention Performance
Figure 5: UniAttention Performance

Reinforcement Learning Training

The RL stage follows a pre‑train‑fine‑tune‑RL flow, but differs from LLM RL in sample format (sparse ID/SID), custom beam‑search for SID generation, and heavy sparse‑parameter synchronization. JD’s Ray‑based RL framework provides multi‑scene compatibility, flexible resource scheduling (collocated vs. disaggregated), and extensible worker pipelines to integrate custom reward services and feature‑concatenation modules.

Figure 6: RL Training Pipeline
Figure 6: RL Training Pipeline

Conclusion

Generative recommendation reshapes the recommendation landscape, demanding AI‑infra that scales linearly with compute growth. JD’s 9N‑LLM demonstrates a holistic solution—spanning efficient data pipelines, multi‑framework and multi‑hardware support, high‑performance sparse/dense engines, and adaptable RL training—positioning it for future expansions in hardware scale, communication bandwidth, and intelligent reward modeling.

TensorFlowPyTorchGPU/NPU
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.