How JD’s 9N‑LLM Engine Powers Scalable Generative Recommendation at Industrial Scale
The article details JD Retail’s 9N‑LLM unified training engine—supporting TensorFlow and PyTorch, GPU and NPU, and both traditional and generative recommendation scenarios—explaining its architecture, high‑throughput sample engine, distributed sparse embedding system, five‑stage pipeline, UniAttention accelerator, and reinforcement‑learning capabilities that together enable TB‑scale data, B‑scale dense parameters, and efficient RL training for real‑world recommendation services.
Generative recommendation (GR) has become a hot research direction because it can model user intent with large language model (LLM)‑style sequence generation, improving diversity and breaking performance ceilings of traditional recommenders. However, GR introduces new challenges: massive multi‑modal samples, TB‑level sparse embeddings, B‑scale dense parameters, and complex pre‑train‑fine‑tune‑RL pipelines.
Background and Motivation
Traditional deep‑learning recommenders face bottlenecks such as saturated feature engineering, difficult user‑intent modeling, error amplification in cascade architectures, and low compute utilization. The rise of LLM‑based GR offers a new paradigm, but scaling it industrially requires a training engine that can handle heterogeneous frameworks, hardware, and workloads.
9N‑LLM Unified Training Engine
JD Retail’s Nine‑Number (9N) algorithm platform built the Oxygen 9N‑LLM engine, a unified training framework that:
Supports both TensorFlow and PyTorch in the same process.
Runs seamlessly on GPU and NPU with near‑zero migration cost.
Handles traditional and generative recommendation scenarios.
Integrates a large‑scale sparse embedding engine, a custom UniAttention library, and a Ray‑based RL training stack.
Accelerates end‑to‑end training for models with up to 10 TB sparse and 10 B dense parameters, achieving >40% MFU.
Efficient Sample Engine
The sample engine is built on Ray Data and provides:
Vectorized preprocessing and Parquet columnar storage to reduce I/O bandwidth and storage cost.
Chainable APIs such as dataset.map().prefetch().take() that operate on Arrow RecordBatches for fast in‑memory processing.
Row‑level checkpointing for lossless fault recovery and elastic scaling.
Dynamic feature concatenation via HBase, enabling on‑the‑fly item‑feature lookup without offline joins.
Fixed‑Period User Event compression to aggregate repeated exposures, cutting redundant computation and storage.
Large‑Scale Distributed Sparse Engine
The engine bridges the TensorFlow sparse ecosystem and PyTorch dense ecosystem through a custom distributed training kernel:
HBM‑MEM hierarchical cache and a five‑stage pipeline (Data Prefetch → Data H2D → Input Dist → Embedding Lookup → Fwd/Bwd/Opt) that overlaps I/O, communication, and computation.
Device/Host KV store with lock‑free queues, enabling 10 TB‑scale sparse embedding sharding across nodes.
GPU‑Direct RDMA and symmetric memory for All‑to‑All embedding synchronization, delivering 5‑30% higher throughput than open‑source baselines.
Optimized attention via the UniAttention library (Cutlass/Triton/Tilelang) that handles heterogeneous masks and multi‑segment sequences, outperforming FlexAttention/Compile by 70%–300%.
Reinforcement‑Learning (RL) Training Stack
RL is essential for fine‑tuning GR models. 9N‑LLM provides a Ray‑based RL framework with:
SingleController and DistributedWorker abstractions for both collocated and disaggregated deployment.
IPC‑based parameter sync for dense parts and row‑level checkpoint sync for sparse parts.
Customizable pipelines that incorporate feature concatenation, reward services, and online inference.
Support for massive sparse parameter synchronization, which is a key bottleneck compared with dense‑only LLM RL.
Optimization Techniques
Beyond the core engine, 9N‑LLM reuses proven LLM training tricks (gradient accumulation, mixed‑precision quantization, dynamic learning‑rate schedules) and adds sparse‑specific strategies:
Feature admission filters (CountFilter, ProbabilityFilter, ShowClickFilter) and eviction policies (CountShrink, L2WeightShrink, TimeFreqCombineShrink, ShowClickShrink) to keep the dynamic vocabulary compact.
Custom sparse optimizers (Lazy Adam, RAdaGrad, AdamDecay, AdamW) and learning‑rate schedulers (linear, cosine, exponential).
Shard‑wise checkpoint save/load with support for FP16/INT8/INT4 quantization and Kafka streaming for online deployment.
Future Directions
The authors emphasize two research fronts: (1) upgrading communication‑compute models and unified memory architectures to sustain hardware scaling, and (2) building next‑generation reward models that integrate multimodal signals and long‑term user value through online RL.
Overall, the 9N‑LLM framework demonstrates how a tightly integrated, multi‑framework, multi‑hardware system can overcome the data, model, and workflow challenges of generative recommendation at JD’s industrial scale.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
