Designing Next‑Gen Recommendation and Search with Agentic RAG Architecture
The article reviews cutting‑edge AI techniques for high‑concurrency, multimodal recommendation and search, detailing Alibaba Cloud's Agentic RAG evolution, Huawei Noah's LLM‑enhanced recommendation pipeline, and Baidu's generative ranking model GRAB, each with architecture diagrams, performance metrics, and real‑world deployment insights.
The piece is based on a technical sharing by Xing Shaomin, head of Alibaba Cloud AI Search, and systematically explains how Alibaba Cloud AI Search tackles high‑concurrency, multimodal data, and complex multi‑hop queries. It outlines the evolution of the Agentic RAG architecture from a single‑agent design to a multi‑agent system, describing how planning, retrieval, and generation modules cooperate to understand and respond to complex intents. The author details the multi‑path retrieval chain that mixes vector, text, database, and graph recalls to improve coverage and accuracy, and provides quantitative GPU‑acceleration gains for indexing and query stages.
Next, the article revisits the progression of recommendation systems from deep learning to large language model (LLM) and AI‑Agent eras, focusing on challenges such as noisy implicit feedback, limited semantic understanding, and difficulty mining user intent. It compares list‑wise and conversational recommendation paradigms and presents Huawei Noah’s KAR project as a case study. The solution uses factorized prompting and a multi‑expert knowledge adapter to map semantic knowledge into the recommendation embedding space, balancing text feature dimensionality with real‑time requirements. Reported results include a 1.5% AUC lift and online A/B‑test data.
The final case study examines Baidu’s GRAB (Generative Ranking for Ads) model, which replaces traditional DLRM pipelines with an end‑to‑end generative sequence model inspired by LLM scaling laws and Transformer architecture. The core innovation is a Q‑Aware RAB causal attention mechanism that injects query‑aware relative bias for adaptive modeling of complex interactions and temporal signals. The paper also describes a two‑stage STS training algorithm to improve efficiency and avoid over‑fitting, a heterogeneous token representation with dual‑loss stacking for hot‑start, and KV‑Cache optimizations for high‑concurrency inference. Quantitative business impact after full deployment is provided.
Overall, the article aggregates these three technical deep‑dives, offering architecture diagrams, performance evaluation data, and practical deployment lessons for building next‑generation recommendation and search systems powered by AI agents and large models.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
