Scaling Generative Recommendation: Inside JD’s 9N-LLM Multi‑Framework Training Engine
This article details JD Retail’s 9N-LLM unified training engine, which integrates TensorFlow and PyTorch across GPU and NPU hardware to tackle the massive data, model size, and reinforcement‑learning complexities of generative recommendation, offering concrete components, performance benchmarks, and future directions.
Background and Challenges
Traditional deep‑learning recommendation models have reached limits in feature engineering, user intent modeling, cascade error amplification, and compute utilization, creating bottlenecks for further gains. The rise of large‑language‑model (LLM) generative techniques has turned recommendation into a sequence‑generation problem, promising to break these limits. However, this new paradigm introduces new requirements: massive multi‑modal samples, TB‑scale sparse embeddings, B‑scale dense parameters, and a complex pre‑train‑fine‑tune‑RL pipeline that stresses hardware and software stacks.
9N-LLM Generative Recommendation Training Framework
JD’s 9N‑LLM engine unifies TensorFlow and PyTorch, supports both GPU and NPU, and addresses the above challenges through a set of core components:
Sample Engine : Built on Ray Data, it decouples data preprocessing from model computation, vectorizes operators, stores tokenized sequences in a KV system, and provides row‑level checkpointing for lossless recovery.
Large‑Scale Distributed Sparse Embedding Engine : Implements a multi‑level cache (Device/Host Memory), five‑stage pipeline, and All2All communication to support 10 TB‑level sparse parameters with up to 40 % MFU.
RL Training Module : Uses Ray‑based distributed actors, SingleController and DistributeWorker abstractions, and flexible collocated/disaggregated deployment to handle custom reward logic and multi‑objective optimization.
Hardware Adaptation : Seamlessly switches between GPU and NPU with near‑zero code changes.
Efficient Sample Engine
The engine reads columnar Parquet files, applies SIMD‑based decompression, and offers a chainable API such as dataset.map().prefetch().take() for vectorized preprocessing. It also transforms DataLoader to batch‑level operations, moving collator work into map(fn) to reduce inter‑process traffic.
Dynamic Sample Flexibility
Dynamic feature concatenation via HBase allows on‑the‑fly merging of policy‑generated samples with item features, eliminating costly offline joins. Fixed‑Period User Event compression aggregates multiple exposures into a single feature, reducing storage and computation while preventing feature leakage.
Lossless Checkpointing and Elastic Scaling
Row‑level identifiers track sample consumption, enabling fast skip of already processed data after failures. Dynamic sharding redistributes remaining samples across workers during elastic scaling, and sample checkpoints are synchronized with model checkpoints for seamless recovery.
Large‑Scale Sparse Distributed Engine
Parameters are partitioned across hosts (Host Memory) and cached in Device Memory (HBM) using a multi‑level KV store. All2All communication distributes keys, retrieves embeddings, and synchronizes gradients, achieving 1.14‑2.44× speedup over open‑source baselines. Benchmarks show embedding lookup latency reductions from 13 ms to 5 ms for small batches and from 54 ms to 32 ms for larger configurations.
UniAttention Acceleration Library
Standard FlashAttn cannot handle the heterogeneous masks of generative recommendation. JD developed UniAttention using Triton, Cutlass, and Tilelang, applying register‑level optimizations and a compute/mask dual‑interval scheduler. This yields 70 %–300 % performance gains over FlexAttention/Compile and fully exploits Hopper’s wgmma instructions.
Reinforcement Learning Training
The RL stage follows a pre‑train‑fine‑tune‑RL flow, but differs from LLM RL in sample format (sparse ID/SID), custom beam‑search for SID generation, and heavy sparse‑parameter synchronization. JD’s Ray‑based RL framework provides multi‑scene compatibility, flexible resource scheduling (collocated vs. disaggregated), and extensible worker pipelines to integrate custom reward services and feature‑concatenation modules.
Conclusion
Generative recommendation reshapes the recommendation landscape, demanding AI‑infra that scales linearly with compute growth. JD’s 9N‑LLM demonstrates a holistic solution—spanning efficient data pipelines, multi‑framework and multi‑hardware support, high‑performance sparse/dense engines, and adaptable RL training—positioning it for future expansions in hardware scale, communication bandwidth, and intelligent reward modeling.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
