How LLMs Are Transforming Long-Term Cross-Domain Interest Modeling for Recommendations
The Datafun Summit 2025 talk by JD’s algorithm engineer Tian Mingyang explains how generative AI is driving a paradigm shift in recommendation systems, detailing the limits of traditional models, the new dynamic cross‑domain inference chain technique, joint engineering‑algorithm optimizations, and the remaining challenges for future deployment.
Background
Large language models (LLMs) enable a paradigm shift in recommendation systems: instead of static matching, they generate multimodal content, perform cross‑domain intent reasoning, and support scenario‑aware inference.
Limitations of traditional modeling
User‑side bottlenecks
Long‑tail interests are missed when user interest trajectories break, especially for low‑frequency but persistent interests.
Cross‑scenario migration (e.g., e‑commerce → search) breaks continuity, causing inaccurate recommendations.
Periodic repetitive purchases (monthly or quarterly) are not automatically recognized, requiring manual search.
Algorithm‑side bottlenecks
RNN gradient decay and limited memory of gated units (LSTM/GRU) restrict capture of dependencies beyond ~100 steps.
Cold‑start and long‑tail prediction suffer from data sparsity and head‑dominant bias.
Attention weights lack semantic interpretability; cross‑domain interest mapping is fragmented.
Dynamic cross‑domain inference chain
LLMs are used to generate user portraits through a fast‑push chain, reducing information loss compared with traditional two‑stage SIM pipelines.
Architectural advantages
LLMs output natural‑language descriptions or dense embeddings directly, preserving semantic richness of low‑frequency behaviors.
Dynamic update efficiency
LoRA fine‑tuning updates less than 1 % of model parameters, enabling minute‑level adaptation (e.g., “ignore actions three days ago” or inject holiday promotions) without full retraining.
Explainability and cross‑domain capability
LLMs can produce textual explanations and multimodal embeddings (via CLIP/BLIP), aligning text, image and video features for seamless cross‑domain fusion.
Modeling workflow
Data integration & feature engineering
Multimodal inputs (text, images, video) are aligned with contrastive learning (CLIP, BLIP‑v2) into a shared space.
LLMs convert textual signals (product titles, comments) into high‑dimensional semantic vectors.
Chain‑of‑thought (CoT) prompts generate user portraits in three modes: unsupervised, optimal supervised, and sub‑optimal supervised.
Profile modeling & optimization
Behavior‑sequence modeling distinguishes short‑term (recent) and long‑term (high‑frequency) interests.
Domain‑specific knowledge bases (e.g., ChineseEcomQA) fine‑tune the model with structured label constraints.
Human review corrects LLM bias (e.g., gender stereotypes) and builds hierarchical interest models.
Storage & expression
Key‑value structured portraits (e.g., {"interests": ["photography", "hiking"]}) enable efficient retrieval.
Natural‑language portraits provide narrative descriptions for downstream business logic.
Layered modeling separates short‑term, long‑term, and potential interests.
Joint engineering‑algorithm optimization
To meet real‑time latency ( 50 ms ) and high throughput, the system combines three layers:
Offline layer : processes 180‑day historical data, stores long‑term embeddings in Redis.
Near‑line layer : consumes Kafka streams for 15‑minute incremental updates.
Online layer : merges offline and near‑line outputs for real‑time portrait service.
Item tiering separates head items (real‑time updated embeddings) from tail items (static semantic IDs).
Model optimizations
TokenID → SemanticID pipeline reduces retrieval cost and improves generation quality.
Joint representation modeling combines user IDs (encoded by a GNN) and behavior IDs (aligned by contrastive learning) in a dual‑channel network.
Dynamic batching, KV‑cache paging, and kernel fusion lower memory and compute overhead.
Model compression & inference
Layer‑wise distillation and mixed‑precision quantization (FP16 for attention, INT4 for feed‑forward) preserve accuracy.
Block‑wise quantization and quantization‑aware training (QAT) further reduce model size.
Head pruning removes attention heads with contribution < 0.01.
Industry progress and outlook
Current trends favor modular upgrades: embedding‑generated components replace specific modules in cascade architectures, while end‑to‑end generative pipelines remain experimental.
Key future challenges:
Compute and memory bottlenecks for ultra‑long sequences (O(n²) attention, >80 GB VRAM).
Position‑encoding limits and context fragmentation for multi‑month histories.
Ethical and privacy compliance (data protection laws, homomorphic encryption, transparency).
Promising directions include:
Sparse attention + state‑space models (e.g., Mamba) for linear‑time processing.
Mixture‑of‑Experts (MoE) routing to activate a subset of experts.
Dynamic position interpolation (YaRN) to extend context windows.
External memory banks for long‑term behavior retrieval.
Lightweight multimodal inference on edge devices.
Conclusion
LLM‑driven dynamic cross‑domain reasoning, together with joint engineering‑algorithm optimizations, can overcome the long‑term and cross‑domain bottlenecks of traditional recommendation pipelines. Remaining hurdles are scalability of attention, memory consumption, and privacy guarantees.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
