How Agentic RAG, LLM‑Powered Recommendations, and Generative Ranking Transform AI Search and Ads

This article surveys cutting‑edge AI techniques—including Alibaba Cloud's Agentic RAG for multimodal search, Huawei Noah's LLM‑enhanced recommendation evolution, and Baidu's generative ranking (GRAB) for ads—detailing their architectures, optimization tricks, performance gains, and real‑world deployment results.

DataFunSummit
DataFunSummit
DataFunSummit
How Agentic RAG, LLM‑Powered Recommendations, and Generative Ranking Transform AI Search and Ads

Agentic Retrieval‑Augmented Generation (RAG) in Alibaba Cloud AI Search

Alibaba Cloud AI Search addresses three primary challenges: ultra‑high query concurrency, multimodal data (text, image, graph, structured tables), and multi‑hop user intents that require several reasoning steps. The solution evolves from a single‑agent RAG pipeline to a multi‑agent architecture that separates planning, retrieval, and generation into dedicated modules that cooperate via a shared context store.

Multi‑Path Retrieval Layer

Four parallel retrieval back‑ends are combined:

Vector similarity search (FAISS‑style ANN) for dense embeddings.

Full‑text inverted index for keyword matching.

Relational database lookup for structured attributes.

Graph‑based traversal for entity‑relationship queries.

Results are merged using a coverage‑aware ranking function that weights recall diversity against relevance scores.

Engine Optimizations

Custom indexing engine built on GPU kernels; supports product quantization (PQ) and OPQ to trade off index size vs. latency. Typical configuration: 64‑dim embeddings, 256‑centroid PQ, achieving 2‑3× speedup over CPU‑only indexing.

GPU‑accelerated batch query processing with torch.cuda.Stream parallelism, reducing 99th‑percentile latency from 120 ms to 35 ms under 10 k QPS.

Extension modules:

NL2SQL layer that translates natural‑language queries into SQL using a fine‑tuned T5‑base model (learning rate 3e‑5, 3 epochs).

Multimodal search pipeline that extracts CLIP image embeddings and aligns them with text vectors via a cross‑modal projection head.

Performance Highlights

End‑to‑end throughput: 15 k QPS on a 4‑GPU node (NVIDIA A100).

Mean reciprocal rank (MRR) improvement of 12 % compared with a baseline single‑vector RAG system.

Latency distribution: 95th‑percentile < 40 ms, 99th‑percentile < 55 ms.

LLM‑Powered Recommendation System (Huawei Noah – KAR Project)

The KAR (Knowledge‑augmented Adapter for Recommendation) project replaces traditional collaborative‑filtering pipelines with a large‑language‑model (LLM) front‑end that enriches item and user representations with semantic knowledge.

Factorized Prompting & Knowledge Adapter

Input prompt is factorized into three components: user profile , item description , and task instruction . Each component is encoded separately and concatenated before feeding to the LLM.

Multi‑expert knowledge adapter consists of three parallel feed‑forward networks, each trained on a distinct knowledge source (e.g., product taxonomy, user reviews, click‑stream patterns). The outputs are summed with a learned gating vector.

Adapter parameters: 2 M per expert, trained with AdamW (lr = 1e‑4) for 5 epochs on a mixed dataset of 200 M interactions.

Integration with Real‑Time Scoring

LLM (LLaMA‑7B) runs in inference‑only mode; embeddings are cached for hot items using a Redis‑backed KV‑Cache to meet ≤5 ms latency constraints.

Final recommendation score = dot‑product(user embedding, item embedding) + adapter‑bias.

Results

AUC increased from 0.842 to 0.857 (+1.5 %).

Online A/B test over 2 weeks showed a 3.2 % lift in click‑through rate (CTR) and a 2.8 % increase in conversion rate.

Model size after adapter fusion: 7.3 B parameters, fitting within a single A100 GPU with 80 GB memory.

GRAB – Generative Ranking for Ads (Baidu)

GRAB replaces the classic Deep Learning Recommendation Model (DLRM) with a Transformer‑based encoder‑decoder that directly generates a ranking score sequence for candidate ads.

Core Architecture

Encoder consumes a concatenated token stream of user behavior (click, view, dwell time) and ad metadata (title, category, bid).

Decoder produces a single scalar per candidate via a generation head (linear layer + sigmoid).

All tokens share a unified embedding space (dimension 512) enabling end‑to‑end gradient flow.

Q‑Aware RAB Causal Attention

The Q‑Aware Relative‑bias Attention (RAB) modifies standard causal attention by adding a query‑dependent bias term b_{i,j}=f(q_i, p_j), where q_i is the query token embedding and p_j is the positional embedding of token j. This bias allows the model to weight recent user actions more heavily while still capturing long‑range dependencies.

Training Strategy – STS Two‑Stage

Stage 1 (pre‑training) : Large‑scale unsupervised next‑token prediction on 500 M anonymized logs (batch size 4096, lr 5e‑4, 3 epochs).

Stage 2 (fine‑tuning) : Supervised ranking loss (pairwise hinge) on a curated set of 50 M labeled impressions (learning rate 1e‑5, early stopping on validation NDCG).

Additional Optimizations

Heterogeneous token representation: separate token types for categorical fields (one‑hot) and continuous fields (bucketized embeddings).

Dual‑loss stacking: combine pairwise hinge loss with a KL‑divergence regularizer to preserve distributional consistency.

KV‑Cache at inference time reduces per‑query compute by reusing encoder key/value states across candidates, achieving ≈30 % latency reduction under 100 k QPS.

Deployment Impact

Full rollout on Baidu’s ad platform (≈200 M daily requests) yielded a 4.5 % increase in revenue per mille (RPM) and a 2.1 % reduction in cost‑per‑click (CPC).

System maintains ≤10 ms 99th‑percentile latency with 8‑GPU inference cluster (NVIDIA H100).

These three case studies demonstrate how multi‑agent orchestration, LLM‑enhanced semantic augmentation, and generative Transformer ranking can be engineered to meet the scalability, multimodality, and relevance requirements of modern search and advertising services.

GPU accelerationRecommendation SystemsAI searchmultimodal retrievalAgentic RAGGenerative Ranking
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.