Unlocking Agentic RAG and Generative Ranking: AI Search & Recommendation Breakthroughs

This article summarizes cutting‑edge techniques from Alibaba Cloud AI Search’s Agentic RAG architecture, Huawei Noah’s LLM‑enhanced recommendation evolution, and Baidu’s GRAB generative ranking model, detailing multi‑agent retrieval, multimodal data handling, scaling laws, causal attention, and performance gains demonstrated through benchmarks and real‑world deployments.

DataFunTalk
DataFunTalk
DataFunTalk
Unlocking Agentic RAG and Generative Ranking: AI Search & Recommendation Breakthroughs

Alibaba Cloud AI Search – Agentic Retrieval‑Augmented Generation (RAG)

The technical presentation details how Alibaba Cloud AI Search addresses three core challenges: (1) high request concurrency, (2) multimodal data (text, images, structured tables, graphs), and (3) multi‑hop queries that require reasoning across heterogeneous sources.

Solution architecture evolves from a single‑agent pipeline to a multi‑agent system composed of:

Planner – parses the user intent, decides which retrieval modalities are needed, and orchestrates downstream agents.

Retriever – implements a multi‑path retrieval chain that simultaneously queries:

Vector similarity indexes for dense embeddings.

Traditional inverted indexes for keyword matching.

Relational databases for structured fields.

Graph stores for entity‑relationship traversal.

Generator – a large language model (LLM) that consumes the retrieved passages and produces a final answer.

Key engineering techniques:

GPU‑accelerated indexing and query processing; indexing time reduced by up to 3× compared with CPU‑only pipelines.

Post‑training quantization (INT8) of the retrieval encoder, cutting latency from 120 ms to 38 ms per query while keeping ≈99 % recall.

Extension modules: NL2SQL layer that translates natural‑language questions into SQL for relational back‑ends.

Multimodal search adapters that embed image features into the same vector space as text.

Performance evaluation on a benchmark of 1 M mixed‑modality documents shows:

Recall@10 improvement from 71 % (single‑agent) to 86 % (multi‑agent).

End‑to‑end latency under 200 ms at 10 k QPS with GPU scaling.

Huawei Noah – Large‑Language‑Model‑Driven Recommendation Systems

The article reviews the transition from conventional deep‑learning recommenders to architectures that embed LLMs and AI agents. Core problems include noisy implicit feedback, shallow semantic representations, and difficulty extracting user intent.

In the KAR (Knowledge‑aware Recommendation) project, LLMs are used as feature enhancers and as components of a multi‑agent recommendation workflow:

Factorized Prompting – decomposes a complex recommendation request into atomic sub‑prompts that LLMs can answer efficiently.

Multi‑Expert Knowledge Adapter – a set of specialist adapters (e.g., product taxonomy, user behavior, contextual signals) that map LLM‑generated semantics into the dense embedding space of the downstream ranking model.

Multi‑Expert Network Design – balances the dimensionality of textual features (up to 768) with real‑time latency constraints ( ≤30 ms per inference) by gating adapters based on request type.

Additional engineering details:

Prompt‑engineering pipeline that includes few‑shot examples and dynamic template selection.

Fine‑tuning strategy: first freeze LLM weights, train adapters on a 10 M interaction dataset, then jointly fine‑tune the ranking backbone.

AI‑Agent coordination framework that routes listwise and conversational recommendation requests to the appropriate sub‑agents.

Experimental results on a production e‑commerce platform:

AUC increased by 1.5 % over the baseline deep‑learning model.

Online A/B test showed a 3.2 % lift in click‑through rate and a 2.7 % increase in conversion.

Baidu – Generative Ranking for Ads (GRAB)

GRAB replaces traditional Deep Learning Recommendation Models (DLRM) with a generative, sequence‑to‑sequence architecture inspired by LLM scaling laws and Transformer design. User behavior logs and candidate ad attributes are encoded as a single token sequence, enabling end‑to‑end generation of a ranking score.

Key components:

Q‑Aware RAB Causal Attention – introduces a query‑aware relative bias term b_{i,j}=f(q_i,q_j) that adjusts attention weights based on the position of the query token, allowing the model to capture time‑varying interactions.

STS Two‑Stage Training –

Stage 1: pre‑train on a large synthetic corpus using masked language modeling to learn generic user‑ad dynamics.

Stage 2: fine‑tune on real click‑through data with a ranking loss (pairwise hinge) and a generation loss (cross‑entropy) jointly optimized.

Heterogeneous Token Representation – hot‑start tokens for high‑frequency ads are represented with dedicated embeddings to reduce cold‑start latency.

Dual‑Loss Stacking – combines a pointwise regression loss and a listwise ListNet loss to improve ranking robustness.

KV‑Cache Inference Optimisation – caches key/value pairs of the Transformer layers across requests, achieving ≈4× throughput at 100 k QPS.

Business impact after full deployment:

Click‑through rate increased by 4.5 %.

Cost‑per‑click reduced by 12 % due to more efficient ad selection.

System latency remained under 50 ms even at peak traffic.

large language modelsRecommendation SystemsAI searchAgentic RAGGenerative Ranking
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.