Agentic RAG, LLM‑Powered Recommendation, and Generative Ranking: Cutting‑Edge AI Search Techniques
This article reviews three advanced AI search solutions—Alibaba Cloud's Agentic RAG architecture for multi‑modal retrieval, Huawei's LLM‑enhanced recommendation system with factorized prompting, and Baidu's generative ranking model GRAB—detailing their technical challenges, design choices, performance gains, and deployment insights.
Alibaba Cloud AI Search – Agentic RAG Architecture
The system addresses high‑concurrency, multimodal data, and multi‑hop query scenarios by evolving from a single‑agent design to a multi‑agent framework. The architecture consists of three cooperating modules:
Planner : parses user intent, decomposes complex queries into sub‑tasks, and orchestrates downstream agents.
Retriever : implements a mixed‑recall pipeline that simultaneously queries four data stores – dense vector index, full‑text inverted index, relational databases, and graph databases. The hybrid strategy improves coverage (recall ↑) and precision (accuracy ↑) for heterogeneous content.
Generator : a large language model (LLM) generates final responses, optionally invoking NL2SQL or multimodal adapters to produce structured results or images.
Key engineering optimizations include:
GPU‑accelerated indexing and query processing with 8‑bit and 4‑bit quantization, yielding up to 3× throughput while keeping latency < 50 ms under 10k QPS.
Dynamic routing of retrieval requests based on query type, reducing unnecessary vector searches by 40%.
Extensible plug‑in interfaces for NL2SQL, vision‑language models, and custom ranking functions.
Performance evaluations on a production workload show 2.1× latency reduction and 15% increase in query success rate compared with the previous single‑agent pipeline.
Huawei Noah’s LLM‑Driven Recommendation System
The paper outlines the migration from conventional deep‑learning recommenders to a pipeline that leverages large language models (LLMs) as semantic feature enhancers. Core challenges addressed are:
High noise in implicit feedback signals.
Limited semantic understanding of user actions.
Difficulties in extracting fine‑grained user intent.
Solution components:
LLM Feature Enhancer : raw interaction logs are fed to an LLM (e.g., ChatGLM‑6B) using factorized prompting that splits the prompt into context, task, and output sections, enabling efficient batch inference.
Multi‑Expert Knowledge Adapter (KAR project) : a set of lightweight adapters (one per domain expert) are trained with LoRA to map LLM‑generated semantics into the recommendation model’s embedding space. The adapters are combined via a gating network that balances feature dimensionality against latency constraints (≤ 30 ms per request).
Prompt Engineering & Fine‑Tuning : iterative prompt refinement and supervised fine‑tuning on a curated click‑through dataset improve intent extraction accuracy by 12%.
Agent Coordination : a lightweight orchestrator routes requests to either the traditional factorization machine path or the LLM‑enhanced path based on confidence thresholds.
Online A/B testing on a major e‑commerce platform reported a 1.5% absolute lift in AUC and a 3.2% increase in conversion rate.
Baidu GRAB – Generative Ranking for Ads
GRAB replaces the classic DLRM (Deep Learning Recommendation Model) with an end‑to‑end generative sequence model inspired by LLM scaling laws. The architecture embeds user behavior sequences and candidate ad tokens into a shared Transformer space, enabling the model to generate a ranking score directly from the sequence.
Key innovations:
Q‑Aware RAB Causal Attention : introduces a query‑aware relative bias term bias_{i,j}=f(query_i, position_j) that adapts attention weights to the current search query, improving modeling of temporal and interaction patterns.
Two‑Stage STS Training : first stage pre‑trains on massive unlabeled logs with a masked sequence objective; second stage fine‑tunes on labeled click‑through data using a pairwise ranking loss.
Heterogeneous Token Representation : combines dense token embeddings for user actions with sparse categorical embeddings for ad features; a dual‑loss (generative + contrastive) stack mitigates over‑fitting and accelerates hot‑start.
KV‑Cache Inference : caches key/value pairs of the static ad embeddings, reducing per‑request compute by ~60% and supporting > 20k QPS with sub‑10 ms latency.
Deployment results show a 4.8% increase in revenue per mille (RPM) and a 2.3% reduction in latency compared with the legacy DLRM pipeline.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
