How Agentic RAG, LLM‑Powered Recommendation, and Generative Ranking Are Redefining AI Search
This article reviews three cutting‑edge AI search and recommendation techniques—Alibaba Cloud's Agentic RAG architecture, Huawei Noah's LLM‑enhanced recommendation pipeline, and Baidu's GRAB generative ranking model—detailing their design challenges, multi‑modal retrieval strategies, performance gains, and real‑world deployment results.
Agentic Retrieval‑Augmented Generation (RAG) in Alibaba Cloud AI Search
The system addresses high‑concurrency, multimodal data, and multi‑hop query scenarios by evolving from a single‑agent pipeline to a coordinated multi‑agent architecture. The workflow consists of three logical stages:
Planning Agent parses the user request, determines the required modalities (text, vector, relational, graph) and generates an execution plan.
Retrieval Agent executes a multi‑path retrieval chain. It simultaneously queries:
Dense vector indexes (GPU‑accelerated FAISS/HNSW) for semantic similarity.
Traditional inverted text indexes for exact keyword matching.
Relational databases (SQL) for structured attribute lookup.
Graph databases (e.g., Neo4j) for relationship‑driven hops.
The results are merged using a relevance‑aware ranker that balances coverage and precision.
Generation Agent receives the retrieved context and produces the final answer with a large language model (LLM). Extensions such as NL2SQL and multimodal search are implemented as plug‑in modules that translate natural language into SQL or image‑text queries before the retrieval step.
Key engineering optimizations include:
Custom indexing engine that quantizes vectors on GPU, reducing query latency by up to 40% compared with CPU‑only pipelines.
KV‑Cache reuse across consecutive queries to support thousands of QPS with sub‑millisecond response times.
Dynamic routing logic that selects the most cost‑effective retrieval path based on query complexity.
LLM‑Driven Recommendation Evolution (Huawei Noah – KAR Project)
The KAR project demonstrates how large language models can be integrated into recommendation pipelines to overcome three classic challenges:
Noise in implicit feedback.
Insufficient semantic understanding of user intents.
Difficulty extracting fine‑grained intent from short interactions.
Technical solution:
Factorized Prompting : A user interaction is decomposed into multiple semantic facets (e.g., intent, context, constraints). Each facet is fed to the LLM with a dedicated prompt template, producing dense embeddings that capture nuanced meaning.
Multi‑Expert Knowledge Adapter : Several expert networks (text, item, context experts) transform LLM outputs into a unified embedding space. The adapter balances dimensionality (typically 128‑256) against real‑time latency constraints (<10 ms per request).
Integration with Existing Ranking : The LLM‑enhanced embeddings are concatenated with traditional collaborative‑filtering features and fed to a downstream ranking model (e.g., DeepFM).
Experimental results reported on a production traffic set show a 1.5 % absolute AUC improvement and statistically significant lift in click‑through rate during online A/B testing.
Additional considerations:
Prompt engineering strategies include few‑shot examples and domain‑specific templates to steer the LLM toward recommendation‑relevant semantics.
Fine‑tuning is performed on a curated dataset of user‑item interactions using LoRA adapters to keep parameter overhead low.
Future work targets cross‑platform knowledge sharing via a unified agent that can invoke multiple LLMs for different domains.
GRAB: Generative Ranking for Ads at Baidu
GRAB replaces conventional DLRM‑based ranking with an end‑to‑end generative sequence model that jointly encodes user behavior and candidate ads.
Core architecture:
Transformer encoder processes a concatenated token sequence: [USER_BEHAVIOR] || [AD_FEATURES]. Heterogeneous token types (categorical IDs, dense embeddings, textual descriptions) are embedded into a shared space.
Q‑Aware RAB Causal Attention : A custom attention mask adds a query‑aware relative bias, allowing the model to weight interactions based on temporal distance and query relevance.
Training methodology:
STS Two‑Stage Training : Stage 1 pre‑trains on large-scale synthetic click data using a language‑model objective; Stage 2 fine‑tunes on real ad‑click logs with a ranking loss.
Dual‑Loss Stacking : Combines a generative loss (next‑token prediction) with a pairwise ranking loss to improve both relevance and calibration.
Heterogeneous Token Representation : Categorical features are projected via learned embeddings, while continuous features are discretized into token buckets.
Inference optimizations:
KV‑Cache is persisted across candidate ads for a single user session, enabling high‑throughput serving (>100 k QPS) with latency <5 ms.
Model size is kept at ~300 M parameters to fit GPU memory budgets while preserving the scaling benefits of LLM‑style Transformers.
Deployment results indicate a measurable increase in revenue per mille (RPM) and a reduction in feature engineering effort, as the model directly learns from raw interaction sequences.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
