How Large Language Models Are Revolutionizing Generative Recommendation Systems
Over the past year, generative recommendation has made substantial progress by leveraging large language models' powerful sequence modeling and reasoning abilities, introducing a new paradigm that replaces complex handcrafted features, addresses traditional recommendation bottlenecks, and outlines the evolution, core technologies, engineering challenges, and future directions of LLM‑based recommendation systems.
Generative Recommendations Powered by Large Language Models
In the last year, generative recommendation (GR) has achieved significant advances, especially by using the strong sequence‑modeling and reasoning capabilities of large language models (LLMs) to improve overall recommendation performance. GRs form a new paradigm distinct from discriminative recommendation, offering the potential to replace traditional systems that rely heavily on handcrafted features.
1. Introduction: Traditional Recommendation Challenges and the LLM Breakthrough
Recommendation systems have evolved through three technical paradigms: Machine‑Learning‑based Recommendation (MLR), Deep‑Learning‑based Recommendation (DLR), and Generative Recommendations (GRs).
1.1 Bottlenecks of Traditional Paradigms
MLR depends on explicit feature engineering and collaborative or content‑based filtering.
DLR uses deep neural networks to automatically learn complex non‑linear representations from raw or sparse features, but model complexity has reached diminishing returns.
Figure 1 shows the increasing complexity of DLRM models, from early DWE to DIN and SIM, highlighting the “model‑size vs. marginal gain” problem.
Key issues include:
Feature‑engineering dependence : mature feature mines are exhausted, and handcrafted features become costly and less generalizable.
Model‑engineering ceiling : current architectures cannot effectively model world knowledge or user intent reasoning across modalities.
Cascade‑architecture error amplification : multi‑stage pipelines (recall‑rough‑rank‑re‑rank) split objectives across teams, causing goal fragmentation and error propagation.
Resource waste in cascade : communication and caching dominate >50% of online service resources.
Low GPU utilization : dense models achieve only 4.6% (training) and 11.2% (inference) MFU, while LLMs on H100 can reach 40‑50% MFU.
1.2 Disruptive Potential of LLMs
LLMs and vision‑language models (VLMs) have made breakthroughs such as scaling laws and advanced reinforcement learning. Their chain‑of‑thought reasoning enables a paradigm shift in recommendation:
Long‑sequence modeling : treat user behavior as a time series and capture deep dependencies via autoregressive prediction.
World‑knowledge injection : pretrained LLM/VLM embeddings contain cross‑domain, multimodal knowledge, alleviating cold‑start for new users/items.
End‑to‑end generation : a single model directly outputs a ranked list, eliminating cascade errors.
The shift is from “predicting similarity” to “reasoning user needs”.
1.3 Why Now?
GRs are exploding in 2025 because LLM maturity aligns with industrial recommendation demands. Key drivers:
LLM ecosystem maturity : improved training (distributed data/model/pipe parallelism, mixed‑precision, SFT, RLHF) and inference (FlashAttention, continuous batching) reduce cost and latency.
Industrial validation : Scaling‑law experiments on recommendation tasks have broken DLRM performance ceilings; companies such as Meta, Meituan, Baidu, ByteDance, and Kuaishou have reported online gains.
2. Technical Evolution: From Modular to End‑to‑End Generative Architectures
2.1 LLM4Rec: Early Explorations
Early work explored three patterns:
LLM Embeddings + RS : use LLMs to generate item/user embeddings for downstream recommendation.
LLM Tokens + RS : LLM produces token identifiers that are fed to a recommendation system.
LLM as RS : the LLM directly generates the recommendation list (still largely academic).
Most practical impact lies in offline preprocessing; the “LLM as RS” paradigm remains costly for production.
2.2 Online Generative Recommendation Paradigms
Recent online deployments fall into two categories:
Collaborate with or replace modules in traditional cascade pipelines (e.g., Google TIGER for recall, Meta GR for ranking).
End‑to‑end generation where a single model produces the final list, removing cascade inconsistencies (e.g., Kuaishou OneRec).
2.3 Core Technical Points
2.3.1 From Discriminative to Generative
Discriminative recommendation predicts a probability that a user likes an item from a predefined candidate set.
Generative recommendation uses a generative model to directly produce likely items without an explicit candidate pool.
2.3.2 Semantic ID Compression
Semantic IDs compress billions of item IDs into a few‑thousand high‑level representations, reducing embedding size and over‑fitting risk while enabling efficient autoregressive generation.
Two quantization methods are common:
RQ‑VAE: residual quantized VAE with multi‑layer codebooks.
RQ‑Kmeans: K‑means derived codebooks without VAE.
During inference, beam‑search generates sequences of semantic IDs, which are then mapped back to real items (with validity filtering to avoid hallucinations).
2.3.3 Sparse Features Remain Crucial
Pure generative models that drop dense DLRM features suffer reproducibility issues. Incorporating all original DLRM features (as done in Meituan MTGR) yields large online gains.
2.3.4 Encoder‑Decoder vs. Decoder‑Only
Industrial GRs favor encoder‑decoder architectures (e.g., Google TIGER, Kuaishou OneRec) for long‑sequence encoding, while decoder‑only LLMs excel at pure language modeling. Encoder‑decoder designs achieve lower computational complexity for cross‑attention between user interests and candidate items.
3. Engineering Challenges
3.1 Model Evolution Drives Infra Upgrades
GRs combine sparse embedding handling from DLRM with dense generation from LLMs, creating unique resource and complexity demands.
3.2 Training Stack Transition
Moving from TensorFlow‑based DLRM to PyTorch‑based LLM stacks enables mixed‑precision, FlashAttention, and advanced parallelism, but requires building sparse‑embedding parameter servers, feature‑gate mechanisms, and native graph export for online inference.
3.3 Multi‑Stage Training & Reinforcement Learning
Training progresses from single‑stage (recall or ranking) to multi‑stage pipelines (pre‑training + fine‑tuning) and incorporates RL‑based reward optimization (GRPO) for multi‑objective business goals.
3.4 Inference Performance Bottlenecks
Online latency must stay within hundreds of milliseconds at tens of thousands QPS. Key optimizations include:
High‑performance kernels for self‑/cross‑attention.
Sequence representation compression to shorten effective length.
End‑to‑end pipeline optimizations (CPU/GPU overlap, efficient beam search, early‑stop filtering).
Model architecture innovations (sparse activations, linear‑time attention).
4. Future Directions
From generation to deep reasoning: enable models to infer user intents beyond immediate item similarity.
Advanced reward mechanisms that capture long‑term satisfaction, diversity, and fairness.
True multimodal alignment of user behavior with text, image, and video.
Parallel generation techniques such as Multi‑Token Prediction (MTP) and diffusion‑based decoding.
Full‑stack end‑to‑end optimization across homepage, recommendation, checkout, and after‑sale stages.
5. Conclusion: A Technological Turning Point
Generative recommendation represents a cognitive leap for recommender systems, breaking performance ceilings, leveraging world knowledge to solve cold‑start, and eliminating cascade errors, thereby redefining the connection between people, products, and contexts for the next decade.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
