Exploring Cutting‑Edge AI Search & Recommendation: Agentic RAG, LLM‑Enhanced Recs, and Baidu’s Generative Ranking
This article reviews three advanced AI-driven solutions—Alibaba Cloud's Agentic RAG for high‑concurrency multimodal search, Huawei Noah's LLM‑augmented recommendation architecture, and Baidu's generative ranking model GRAB—detailing their challenges, designs, performance gains, and practical deployment insights.
Alibaba Cloud AI Search – Agentic RAG Technical Practice
Alibaba Cloud AI Search addresses high‑concurrency, multimodal data, and multi‑hop query scenarios by evolving from a single‑agent Retrieval‑Augmented Generation (RAG) system to a multi‑agent architecture. The system consists of three coordinated modules:
Planner Agent parses user intent and generates a structured plan.
Retriever Agent executes the plan using a multi‑path retrieval chain that combines:
Generator Agent consumes the retrieved results and produces a final response.
Key engineering optimizations include:
Self‑developed search engine with GPU‑accelerated indexing; quantization reduces index size by up to 70 % while preserving recall.
Hybrid retrieval latency under 30 ms for 10 M vectors on a single GPU.
Extension modules for NL2SQL translation and multimodal (image‑text) search, implemented as plug‑in adapters to the Retriever Agent.
Performance evaluation shows a 2.3× increase in query coverage and a 15 % boost in relevance metrics (NDCG@10) compared with a baseline single‑path RAG.
Huawei Noah’s Ark Lab – LLM‑Powered Recommendation (KAR Project)
The KAR project demonstrates how large language models (LLMs) can be integrated into recommendation pipelines to overcome noisy implicit feedback and limited semantic understanding.
Architecture Overview
Factorized Prompting : User interaction logs are transformed into concise prompts that guide the LLM to generate enriched semantic features.
Multi‑Expert Knowledge Adapter : A set of specialist networks (text, categorical, temporal) processes LLM outputs and maps them into the same embedding space used by the downstream ranking model.
Embedding Fusion : Adapter outputs are concatenated with traditional collaborative‑filtering embeddings before being fed to a two‑tower ranking model.
Experimental results:
AUC improvement of 1.5 % over the production baseline.
Online A/B test shows a 3.2 % increase in click‑through rate (CTR) and a 2.8 % lift in conversion.
Implementation Details
LLM fine‑tuned on 200 M domain‑specific utterances using LoRA adapters (learning rate 1e‑4, 3 epochs).
Prompt templates include user profile, recent actions, and target item description.
Knowledge adapter employs a gating mechanism to balance contribution from each expert, ensuring latency stays below 50 ms per request.
Future roadmap proposes a multi‑agent collaborative framework where separate agents handle intent detection, feature generation, and ranking.
Baidu – GRAB (Generative Ranking for Ads)
GRAB replaces traditional Deep Learning Ranking Models (DLRM) with an end‑to‑end generative Transformer that models user behavior sequences and ad candidates as a unified generation task.
Core Model Design
Input sequence: [USER_BEHAVIOR_TOKENS] → [TARGET_AD_TOKENS] where tokens are heterogeneous (categorical IDs, continuous features, and textual embeddings).
Q‑Aware RAB Causal Attention: introduces a query‑aware relative bias bias(i, j) = f(query, i‑j) to capture time‑decay and query‑specific interactions.
Dual‑Loss Stacking: combines a generative language modeling loss with a pairwise ranking loss to improve both relevance and calibration.
Training Procedure
STS Two‑Stage Training :
Stage 1 – Pre‑train on massive click logs using masked language modeling to learn generic user‑ad dynamics.
Stage 2 – Fine‑tune with the dual loss on a curated high‑quality dataset to mitigate over‑fitting.
KV‑Cache inference optimization reduces per‑request GPU memory by 40 % and supports >10 k QPS with sub‑10 ms latency.
Deployment outcomes:
Business metrics show a 4.5 % increase in revenue per mille (RPM) and a 2.1 % reduction in cost‑per‑click (CPC) after full rollout.
Model size 350 M parameters, serving on a cluster of 8 A100 GPUs with mixed‑precision inference.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
