Artificial Intelligence 29 min read

How Large Language Models Are Revolutionizing Generative Recommendation Systems

Over the past year, generative recommendation has made substantial progress by leveraging large language models' powerful sequence modeling and reasoning abilities, introducing a new paradigm that replaces complex handcrafted features, addresses traditional recommendation bottlenecks, and outlines the evolution, core technologies, engineering challenges, and future directions of LLM‑based recommendation systems.

JD Tech Talk

Oct 27, 2025

How Large Language Models Are Revolutionizing Generative Recommendation Systems

Generative Recommendations Powered by Large Language Models

In the last year, generative recommendation (GR) has achieved significant advances, especially by using the strong sequence‑modeling and reasoning capabilities of large language models (LLMs) to improve overall recommendation performance. GRs form a new paradigm distinct from discriminative recommendation, offering the potential to replace traditional systems that rely heavily on handcrafted features.

1. Introduction: Traditional Recommendation Challenges and the LLM Breakthrough

Recommendation systems have evolved through three technical paradigms: Machine‑Learning‑based Recommendation (MLR), Deep‑Learning‑based Recommendation (DLR), and Generative Recommendations (GRs).

1.1 Bottlenecks of Traditional Paradigms

MLR depends on explicit feature engineering and collaborative or content‑based filtering.

DLR uses deep neural networks to automatically learn complex non‑linear representations from raw or sparse features, but model complexity has reached diminishing returns.

Figure 1 shows the increasing complexity of DLRM models, from early DWE to DIN and SIM, highlighting the “model‑size vs. marginal gain” problem.

Key issues include:

Feature‑engineering dependence : mature feature mines are exhausted, and handcrafted features become costly and less generalizable.

Model‑engineering ceiling : current architectures cannot effectively model world knowledge or user intent reasoning across modalities.

Cascade‑architecture error amplification : multi‑stage pipelines (recall‑rough‑rank‑re‑rank) split objectives across teams, causing goal fragmentation and error propagation.

Resource waste in cascade : communication and caching dominate >50% of online service resources.

Low GPU utilization : dense models achieve only 4.6% (training) and 11.2% (inference) MFU, while LLMs on H100 can reach 40‑50% MFU.

1.2 Disruptive Potential of LLMs

LLMs and vision‑language models (VLMs) have made breakthroughs such as scaling laws and advanced reinforcement learning. Their chain‑of‑thought reasoning enables a paradigm shift in recommendation:

Long‑sequence modeling : treat user behavior as a time series and capture deep dependencies via autoregressive prediction.

World‑knowledge injection : pretrained LLM/VLM embeddings contain cross‑domain, multimodal knowledge, alleviating cold‑start for new users/items.

End‑to‑end generation : a single model directly outputs a ranked list, eliminating cascade errors.

The shift is from “predicting similarity” to “reasoning user needs”.

1.3 Why Now?

GRs are exploding in 2025 because LLM maturity aligns with industrial recommendation demands. Key drivers:

LLM ecosystem maturity : improved training (distributed data/model/pipe parallelism, mixed‑precision, SFT, RLHF) and inference (FlashAttention, continuous batching) reduce cost and latency.

Industrial validation : Scaling‑law experiments on recommendation tasks have broken DLRM performance ceilings; companies such as Meta, Meituan, Baidu, ByteDance, and Kuaishou have reported online gains.

2. Technical Evolution: From Modular to End‑to‑End Generative Architectures

2.1 LLM4Rec: Early Explorations

Early work explored three patterns:

LLM Embeddings + RS : use LLMs to generate item/user embeddings for downstream recommendation.

LLM Tokens + RS : LLM produces token identifiers that are fed to a recommendation system.

LLM as RS : the LLM directly generates the recommendation list (still largely academic).

Most practical impact lies in offline preprocessing; the “LLM as RS” paradigm remains costly for production.

2.2 Online Generative Recommendation Paradigms

Recent online deployments fall into two categories:

Collaborate with or replace modules in traditional cascade pipelines (e.g., Google TIGER for recall, Meta GR for ranking).

End‑to‑end generation where a single model produces the final list, removing cascade inconsistencies (e.g., Kuaishou OneRec).

2.3 Core Technical Points

2.3.1 From Discriminative to Generative

Discriminative recommendation predicts a probability that a user likes an item from a predefined candidate set.

Generative recommendation uses a generative model to directly produce likely items without an explicit candidate pool.

2.3.2 Semantic ID Compression

Semantic IDs compress billions of item IDs into a few‑thousand high‑level representations, reducing embedding size and over‑fitting risk while enabling efficient autoregressive generation.

Two quantization methods are common:

RQ‑VAE: residual quantized VAE with multi‑layer codebooks.

RQ‑Kmeans: K‑means derived codebooks without VAE.

During inference, beam‑search generates sequences of semantic IDs, which are then mapped back to real items (with validity filtering to avoid hallucinations).

2.3.3 Sparse Features Remain Crucial

Pure generative models that drop dense DLRM features suffer reproducibility issues. Incorporating all original DLRM features (as done in Meituan MTGR) yields large online gains.

2.3.4 Encoder‑Decoder vs. Decoder‑Only

Industrial GRs favor encoder‑decoder architectures (e.g., Google TIGER, Kuaishou OneRec) for long‑sequence encoding, while decoder‑only LLMs excel at pure language modeling. Encoder‑decoder designs achieve lower computational complexity for cross‑attention between user interests and candidate items.

3. Engineering Challenges

3.1 Model Evolution Drives Infra Upgrades

GRs combine sparse embedding handling from DLRM with dense generation from LLMs, creating unique resource and complexity demands.

3.2 Training Stack Transition

Moving from TensorFlow‑based DLRM to PyTorch‑based LLM stacks enables mixed‑precision, FlashAttention, and advanced parallelism, but requires building sparse‑embedding parameter servers, feature‑gate mechanisms, and native graph export for online inference.

3.3 Multi‑Stage Training & Reinforcement Learning

Training progresses from single‑stage (recall or ranking) to multi‑stage pipelines (pre‑training + fine‑tuning) and incorporates RL‑based reward optimization (GRPO) for multi‑objective business goals.

3.4 Inference Performance Bottlenecks

Online latency must stay within hundreds of milliseconds at tens of thousands QPS. Key optimizations include:

High‑performance kernels for self‑/cross‑attention.

Sequence representation compression to shorten effective length.

End‑to‑end pipeline optimizations (CPU/GPU overlap, efficient beam search, early‑stop filtering).

Model architecture innovations (sparse activations, linear‑time attention).

4. Future Directions

From generation to deep reasoning: enable models to infer user intents beyond immediate item similarity.

Advanced reward mechanisms that capture long‑term satisfaction, diversity, and fairness.

True multimodal alignment of user behavior with text, image, and video.

Parallel generation techniques such as Multi‑Token Prediction (MTP) and diffusion‑based decoding.

Full‑stack end‑to‑end optimization across homepage, recommendation, checkout, and after‑sale stages.

5. Conclusion: A Technological Turning Point

Generative recommendation represents a cognitive leap for recommender systems, breaking performance ceilings, leveraging world knowledge to solve cold‑start, and eliminating cascade errors, thereby redefining the connection between people, products, and contexts for the next decade.