Artificial Intelligence 13 min read

Accelerating Generative Recommendation with NVIDIA TensorRT‑LLM in JD Advertising

JD Advertising accelerates its generative‑recall recommendation system by integrating NVIDIA TensorRT‑LLM, which simplifies the pipeline, injects LLM knowledge, scales to billions of parameters, and delivers over five‑fold throughput gains, one‑fifth the cost, and significant CTR improvements in both recommendation and search.

JD Retail Technology

Feb 12, 2025

Accelerating Generative Recommendation with NVIDIA TensorRT‑LLM in JD Advertising

This article introduces how JD Advertising tackles the new challenges posed by large language models (LLMs) in advertising scenarios by adopting the NVIDIA TensorRT‑LLM inference engine to accelerate generative‑recall inference.

01 Generative Recommendation System Advantages

Traditional recommendation systems retrieve candidates through multiple recall modules and rank them with coarse‑ and fine‑ranking models. Generative recommendation, powered by LLMs, offers three main benefits:

1) Simplified workflow : Shifts from a multi‑stage discriminative architecture to a single‑stage generative architecture, directly generating recommendation results and reducing system complexity.

2) Knowledge integration : LLMs bring world knowledge and reasoning ability, alleviating data sparsity in new‑user, new‑item, and new‑domain scenarios, thus improving cold‑start performance and transferability.

3) Scaling law : Unlike sparse CTR models that suffer diminishing returns with scale, LLM performance continues to improve as model size grows, enabling better recommendation quality.

Figure 1: Comparison of traditional and LLM‑based generative recommendation systems (source: arXiv:2309.01157)

02 Generative Recall Scheme Introduction

Using JD Advertising as a case study, the article details the application of LLMs in recommendation.

2.1 Generative Recall Algorithm and Implementation Overview

The generative pipeline involves two grounding steps: linking items to natural language and linking user behavior to target items. The process includes:

1) Item representation : Items are encoded as short text sequences (semantic IDs) derived from high‑click titles and categories, transformed into vectors by an encoder, and quantized with RQ‑VAE.

2) User profile & behavior modeling : Prompts are constructed to convert user profiles and historical actions into textual sequences, e.g., “User clicked these items in order: … What item will the user click next?”

3) Model training : The generative model is trained on a next‑token prediction task using the user representation as input and the item semantic ID as output.

4) Model inference : After training, the model predicts semantic IDs for given user contexts, which are then mapped back to actual item IDs. Detailed algorithmic description is available in the linked overview.

2.2 Engineering Adaptation for LLM Deployment

Traditional deep‑learning recall models contain tens of thousands to millions of parameters, mainly embeddings. Generative recall models based on LLMs scale to 0.5 B–7 B parameters, dramatically increasing compute requirements—often tens to hundreds of times higher than classic models. Deploying such large models with millisecond‑level latency while controlling costs demands extreme performance optimization of the online inference stack.

Figure 2: Online inference architecture

2.3 LLM Construction Optimization with TensorRT‑LLM and System Deployment

• Modeling layer: TensorRT‑LLM is used to build and optimize the LLM, integrating it with existing Python and TensorFlow pipelines, and leveraging custom TensorFlow operators for user‑behavior features.

• Inference optimization layer: Techniques such as inflight batching, constrained sampling, Flash Attention, and Paged Attention are applied to maximize per‑GPU throughput and minimize latency.

• System deployment: Generative recall runs in parallel with traditional multi‑branch recall modules, consuming fewer resources and delivering superior recall performance.

Figure 3: Parallel deployment of generative and traditional recall

2.4 Application Effects in Recommendation and Search

Generative recall has been deployed in JD Advertising’s recommendation and search pipelines. A/B tests show significant lifts in click‑through rate (CTR) and conversion. In search, the semantic understanding of LLMs improves long‑tail query coverage and overall relevance.

03 Inference Optimization and Acceleration

Using NVIDIA TensorRT‑LLM with custom business‑specific optimizations, the online inference meets latency requirements while achieving more than a 5× throughput increase compared to the baseline, effectively reducing deployment cost to one‑fifth.

Figure 4: TensorRT‑LLM vs. baseline (Qwen2‑1.5B, beam 5, vocab 150k, input 150, output 4)

The beam width configuration critically influences retrieval accuracy; larger beam widths increase candidate diversity but require more GPU memory. Collaboration with NVIDIA DevTech enabled custom high‑performance GPU kernels, allowing wider beams without sacrificing latency.

04 Continuous Optimization for Model Efficiency

Future work focuses on three directions:

1) Scaling model size : Explore pruning, quantization, and distributed inference to support 0.5 B–6 B parameter models in real‑time.

2) Extending user‑behavior input : Compress longer behavior sequences and cache long‑term features to balance effectiveness and compute cost.

3) Fusing sparse and dense models : Combine traditional CTR models (sparse) with dense LLMs for joint inference, leveraging the strengths of both.

For more technical practices, see the following resources:

JD Retail Advertising R&D: Next‑Generation Advertising System in the Era of Large Models

ECCV 2024 | JD Retail Advertising Creative: Trustworthy Image Generation Based on Human Feedback

Generative Recommendation System and JD Alliance Advertising Practice

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Inference Optimization Recommendation Systems TensorRT-LLM

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.