Artificial Intelligence 7 min read

Unlocking AI Search: Agentic RAG, LLM‑Powered Recommendations, and Generative Ranking Explained

This article summarizes three cutting‑edge AI search and recommendation techniques—Alibaba Cloud's Agentic RAG architecture, Huawei's LLM‑enhanced recommendation system evolution, and Baidu's generative ranking model GRAB—detailing their challenges, design choices, performance gains, and practical deployment insights.

DataFunTalk

Dec 23, 2025

Unlocking AI Search: Agentic RAG, LLM‑Powered Recommendations, and Generative Ranking Explained

Agentic Retrieval‑Augmented Generation (RAG) in Alibaba Cloud AI Search

Alibaba Cloud AI Search addresses high‑concurrency, multimodal data, and multi‑hop query scenarios by evolving from a single‑agent RAG to a multi‑agent system. The architecture separates three functional modules:

Planner : interprets user intent, decides which retrieval sources to use, and orchestrates downstream agents.

Retriever : executes a mixed‑recall pipeline that simultaneously queries vector indexes, text inverted lists, relational databases, and graph stores. This improves coverage for heterogeneous corpora.

Generator : generates final answers using a large language model (LLM) conditioned on retrieved passages.

Key engineering techniques include:

GPU‑accelerated indexing and query quantization, which reduce latency by up to 40 % compared with CPU‑only pipelines.

NL2SQL support that translates natural‑language queries into SQL statements for structured data retrieval.

Multimodal extensions that embed images and audio alongside text, enabling cross‑modal search.

Performance evaluations on production workloads show latency under 150 ms for 10 k QPS and recall improvements of 12 % over a pure vector‑only baseline.

LLM‑Powered Recommendation System (Huawei Noah’s Ark Lab)

Huawei’s KAR project demonstrates how large language models can be integrated as feature enhancers in recommendation pipelines. The workflow consists of:

Factorized Prompting : decomposes a user request into multiple sub‑prompts that extract semantic cues (e.g., intent, constraints) from the LLM.

Multi‑Expert Knowledge Adapters : a set of lightweight adapter modules, each specialized for a knowledge domain (e.g., product taxonomy, user behavior). The adapters map LLM‑derived semantics into the same embedding space used by the downstream ranking model.

By balancing the dimensionality of LLM‑generated text features with the real‑time latency budget (≤30 ms per request), the system achieves a 1.5 % absolute AUC lift in online A/B tests. Additional capabilities include dialogue‑style recommendation, prompt‑engineering for controllable generation, and fine‑tuning pipelines that keep the base LLM frozen while updating adapters.

GRAB: Generative Ranking for Ads (Baidu)

GRAB replaces the traditional DLRM‑based ranking stack with an end‑to‑end generative sequence model that jointly encodes user behavior histories and candidate ads in a unified Transformer representation. Major technical contributions are:

Q‑Aware RAB Causal Attention : introduces a query‑aware relative bias term that captures complex feature interactions and temporal dynamics across the behavior‑ad sequence.

STS Two‑Stage Training : first pre‑trains on a large noisy dataset, then fine‑tunes with a smaller high‑quality set, mitigating over‑fitting.

Heterogeneous Token Representation : encodes categorical, continuous, and textual features as distinct token types, allowing the model to learn specialized embeddings.

Stacked Dual‑Loss Strategy : combines a generation loss (next‑token prediction) with a ranking loss (pairwise hinge) to improve both relevance and calibration.

KV‑Cache Inference : caches key‑value pairs of the Transformer layers for the user behavior prefix, enabling sub‑millisecond latency for real‑time ad ranking.

After full deployment, GRAB delivered measurable business gains (e.g., click‑through‑rate increase of ~3 % and revenue uplift of ~2 %) while supporting >100 k QPS with latency under 50 ms.

AI recommendation system RAG Search Generative Ranking Multi‑Agent

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.