How Agentic RAG and Generative Ranking Are Redefining AI Search and Recommendation

This article summarizes three cutting‑edge AI techniques—Alibaba Cloud's Agentic RAG architecture for multimodal search, Huawei Noah's large‑model‑driven recommendation system evolution, and Baidu's generative ranking (GRAB) model for ads—detailing their challenges, designs, performance gains, and practical deployment insights.

DataFunTalk
DataFunTalk
DataFunTalk
How Agentic RAG and Generative Ranking Are Redefining AI Search and Recommendation

Agentic Retrieval‑Augmented Generation (RAG) in Alibaba Cloud AI Search

The article details a multi‑agent RAG architecture designed for high‑concurrency, multimodal, and multi‑hop query scenarios. Key technical components include:

Agent hierarchy : a planning agent decides the overall query strategy, a retrieval agent orchestrates a hybrid retrieval pipeline, and a generation agent produces the final answer. The system can scale from a single‑agent to a coordinated multi‑agent network.

Hybrid retrieval pipeline : combines four recall sources—vector similarity, full‑text search, relational database lookup, and graph traversal. Each source contributes a candidate set that is re‑ranked to improve coverage and precision.

GPU‑accelerated indexing and quantization : indexing and query processing are offloaded to GPUs; 8‑bit and 4‑bit quantization are benchmarked, showing up to 3× latency reduction with <10% recall loss.

NL2SQL and multimodal extensions : a dedicated NL2SQL module translates natural‑language questions into SQL statements for structured data, while a multimodal encoder integrates image and text embeddings for cross‑modal retrieval.

Performance metrics : end‑to‑end latency under 150 ms for 10 k QPS, recall@10 > 0.92 on a mixed corpus of 200 M documents, and 1.8× throughput increase compared with a baseline single‑agent RAG.

Recommendation System Evolution with Large Models (Huawei Noah)

The Huawei Noah “KAR” project demonstrates how large‑language models (LLMs) can be incorporated into recommendation pipelines to overcome traditional challenges such as noisy implicit feedback and limited semantic understanding.

Factorized prompting : the user query is decomposed into multiple sub‑prompts (e.g., intent, context, constraints) that are fed to an LLM to generate enriched textual features.

Multi‑expert knowledge adapter : a set of lightweight expert networks (each specialized for a knowledge domain) processes the LLM output and maps it into a dense recommendation embedding space. The adapter balances feature dimensionality (typically 128–256) with real‑time latency (<30 ms).

Training strategy : the adapter is fine‑tuned on a mixture of supervised click data and self‑generated pseudo‑labels from the LLM, using a combined loss L = L_{click} + λ·L_{distill} with λ = 0.2.

Results : online A/B testing reports a 1.5 % absolute lift in AUC and a 3 % increase in click‑through rate, while maintaining the same inference budget.

GRAB: Generative Ranking for Ads (Baidu)

GRAB replaces traditional feature‑heavy DLRM pipelines with an end‑to‑end generative sequence model based on Transformer scaling laws.

Unified representation : user behavior sequences and candidate ad tokens are concatenated and encoded jointly, allowing the model to generate a relevance score directly.

Q‑Aware RAB causal attention : a custom attention mask adds a query‑aware relative bias term bias_{i,j}=α·(pos_i‑pos_j)+β·sim(q, k_j), enabling adaptive handling of temporal signals and query‑specific interactions.

STS two‑stage training :

Stage 1 – pre‑train on large-scale click logs with a next‑token prediction objective.

Stage 2 – fine‑tune on high‑value conversion data using a contrastive loss to mitigate over‑fitting.

Heterogeneous token representation : separate token embeddings for user actions, ad attributes, and auxiliary signals (e.g., time‑of‑day) are summed before feeding into the Transformer.

Dual‑loss stacking : the model optimizes both a regression loss for predicted CTR and a ranking loss (pairwise log‑sigmoid) to improve order quality.

KV‑Cache inference optimization : caches key/value pairs for static ad embeddings, reducing per‑request compute by ~40 % and supporting >20 k QPS with sub‑50 ms latency.

Business impact : after full rollout, the system achieved a 12 % increase in revenue per mille (RPM) and a 9 % reduction in latency compared with the previous DLRM baseline.

large language modelsRAGRecommendation SystemsAI searchMulti-agent architectureGenerative Ranking
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.