How Generative Recommendation Systems Transform E‑Commerce with LLMs

This article explains how large language models reshape recommendation systems by simplifying pipelines, integrating world knowledge, and leveraging scaling laws, and details the engineering steps for deploying generative recall models—including product encoding, user prompting, model training, TensorRT‑LLM optimization, and continuous performance improvements.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
How Generative Recommendation Systems Transform E‑Commerce with LLMs

Generative Recommendation System Advantages

Recommendation systems aim to predict user interests from historical behavior and suggest relevant items. Traditional systems use multiple recall modules (popular items, personalized recall, deep recall) to retrieve many candidates, then apply a simple coarse‑ranking model followed by fine‑ranking to produce the final list.

With large language models (LLMs) applied to recommendation, generative recommendation systems show three main advantages:

1) Simplified workflow: The architecture shifts from multi‑stage discriminative filtering to a single‑stage generative approach that directly produces recommendation results, reducing system complexity.

2) Knowledge integration: LLMs bring extensive world knowledge and reasoning ability, overcoming data limitations of traditional e‑commerce platforms. They improve cold‑start performance for new users, new items, and new domains, offering better transferability.

3) Scaling law: Unlike sparse click‑through‑rate (CTR) models whose performance plateaus as they grow, LLMs exhibit scaling‑law behavior where larger models continuously improve, enabling breakthroughs beyond conventional performance limits.

Comparison of traditional and LLM‑based generative recommendation systems
Comparison of traditional and LLM‑based generative recommendation systems

Generative Recall Solution Overview

1. Generative Recall Algorithm and Implementation

Generative recommendation involves two grounding steps: linking items to natural language and linking user behavior to target items. The process includes:

Product representation: Directly generating full item descriptions is impractical, so short text sequences called semantic IDs are used. High‑click items’ titles, categories, etc., are encoded into vectors, then residual‑quantized with RQ‑VAE to obtain IDs such as <a_99><b_225><c_67><d_242>.

User profile & behavior modeling: Prompts are constructed to turn user information into text sequences, e.g.,

用户按时间顺序点击过这些商品:<a_112><b_160><c_67><d_138>,<a_71><b_30><c_228><d_128>,<a_20><b_251><c_30><d_178> 你预测用户下一个可能点击的商品是?

.

Model training: With user representations as input and product semantic IDs as output, the model is trained on a next‑token prediction task.

Model inference: After training, the generative model receives user information and predicts product semantic IDs, which map back to actual item IDs in the catalog.

2. Engineering Adaptation for LLM Deployment

Traditional deep‑learning recall models have tens of thousands to millions of parameters and rely mainly on embedding layers. Generative recall models built on LLMs range from 0.5 B to 7 B parameters, primarily dense networks, demanding significantly more compute—often tens to hundreds of times higher than conventional models. Deploying such large models online with millisecond‑level latency while controlling costs requires extreme inference‑engine optimization.

Online inference architecture
Online inference architecture

3. Optimization and Deployment with TensorRT‑LLM

Using TensorRT‑LLM, the LLM is built and optimized at the modeling layer, then integrated into the existing ecosystem via Python and TensorFlow APIs. Inference is accelerated with techniques such as inflight batching, constrained sampling, Flash Attention, and Paged Attention, maximizing per‑GPU throughput and minimizing latency.

Generative recall is deployed in parallel with traditional multi‑branch recall, consuming fewer resources and achieving superior recall performance.

Generative recall parallel with traditional multi‑branch recall
Generative recall parallel with traditional multi‑branch recall

4. Application Results in Recommendation and Search

Generative recall has been successfully applied in JD advertising recommendation and search scenarios. In recommendation, AB tests show significant lifts in click‑through rate (CTR) and conversion. In search, the LLM’s semantic understanding improves query‑item matching, especially for long‑tail queries, yielding noticeable gains in fill‑rate, CTR, and revenue.

Inference Optimization: Reducing Latency and Boosting Throughput

Online inference based on NVIDIA TensorRT‑LLM, combined with custom business‑specific optimizations, meets real‑time latency requirements while dramatically increasing throughput. In GPU tests, TensorRT‑LLM achieved over five‑fold throughput improvement compared to a baseline under a 100 ms latency constraint, reducing deployment cost to one‑fifth.

TensorRT‑LLM vs baseline performance (Qwen2‑1.5B)
TensorRT‑LLM vs baseline performance (Qwen2‑1.5B)

Appropriate beam width configuration is crucial; higher beam width increases candidate items and improves retrieval accuracy, but also raises computational cost.

Collaborating with NVIDIA DevTech, custom high‑performance GPU operators were developed to support larger beam widths and meet online demand.

Continuous Optimization for Model Efficiency

Future work focuses on three areas:

1) Scaling model size for real‑time inference: Current generative recommendation models are limited to 0.5 B–6 B parameters due to compute, latency, and cost constraints. Offline experiments show larger models boost online performance, prompting research into model pruning, quantization, efficient sampling, and distributed inference architectures.

2) Extending user behavior input: Longer user history improves recommendation quality but increases resource consumption. Solutions include token sequence compression and KV‑cache reuse for long‑term behavior, balancing effectiveness and efficiency.

3) Merging sparse and dense models: Combining traditional sparse CTR models with dense LLMs can leverage the high‑dimensional efficiency of sparse features and the deep semantic understanding of dense models, creating a hybrid system that is both fast and accurate.

QR code for technical community
QR code for technical community

Scan to join the technical community

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMRecommendation Systemsonline inferenceAI OptimizationGenerative RecommendationTensorRT-LLM
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.