Artificial Intelligence 10 min read

Advantages and Engineering Implementation of Generative Recommendation Systems Using Large Language Models

This article explains how generative recommendation systems powered by large language models simplify the recommendation pipeline, integrate world knowledge, benefit from scaling laws, and require specialized engineering optimizations such as TensorRT‑LLM deployment, inference acceleration, and hybrid model strategies to achieve low latency and high throughput in real‑world e‑commerce scenarios.

JD Tech Talk

Jan 14, 2025

Advantages and Engineering Implementation of Generative Recommendation Systems Using Large Language Models

Traditional recommendation systems predict user interests from historical behavior by invoking multiple recall modules (hot items, personalized, deep recall, etc.), applying a coarse ranking model to filter candidates, and finally using fine‑ranking and re‑ranking models to produce the final list.

With the introduction of large language models (LLMs), generative recommendation systems offer three main advantages: (1) simplifying the workflow by moving from a multi‑stage discriminative architecture to a single‑stage generative architecture that directly produces recommendations; (2) integrating extensive world knowledge and reasoning capabilities of LLMs to overcome data sparsity, especially for cold‑start users, new items, and new domains; and (3) leveraging scaling laws, where increasing model size continuously improves performance, unlike traditional CTR‑sparse models that suffer diminishing returns.

An example from JD.com’s advertising scenario demonstrates how LLM‑based generative recall is applied in practice.

The generative recall solution consists of two grounding steps: linking products to natural language and linking user behavior to target products. Product representation uses short text sequences called semantic IDs, e.g., <a_99><b_225><c_67><d_242>. User profiles and histories are converted into prompt texts, such as "User clicked these items in order: <a_112><b_160><c_67><d_138>, ... What is the next likely item?". The model is trained on a next‑token prediction task and, after training, can generate product semantic IDs from user inputs, which are then mapped back to actual product IDs.

Deploying such LLMs poses engineering challenges: model parameters range from 0.5 B to 7 B, dramatically increasing compute requirements compared to traditional embedding‑based recall models. Achieving millisecond‑level online inference while controlling resource costs demands extreme performance optimization of the inference stack.

Optimization is performed using NVIDIA TensorRT‑LLM. At the modeling layer, TensorRT‑LLM builds and optimizes the LLM, integrating it with existing Python and TensorFlow pipelines and custom operators for user behavior features. At the inference layer, techniques such as inflight batching, constrained sampling, Flash Attention, and Paged Attention boost per‑GPU throughput and reduce latency. The system is deployed in parallel with traditional multi‑branch recall, resulting in lower resource consumption and shorter runtime.

Benchmarking on NVIDIA GPUs shows that, under a 100 ms latency budget for advertising scenarios, TensorRT‑LLM with custom optimizations achieves more than five‑fold higher throughput than the baseline, effectively reducing deployment cost to one‑fifth.

Future work focuses on three directions: (1) scaling models larger (up to 6 B parameters) while maintaining real‑time inference through model pruning, quantization, and distributed inference; (2) extending user behavior inputs via token sequence compression and KV‑cache reuse to balance effectiveness and efficiency; and (3) combining sparse CTR models with dense LLMs for hybrid inference that exploits both high‑dimensional sparse features and deep semantic understanding.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

.ai LLM Inference Optimization TensorRT-LLM

Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.