How Alibaba Cloud’s AI Search Evolves with Agentic RAG and Multi‑Model Innovations
This article details Alibaba Cloud AI Search’s development journey, covering its dual product lines, the evolution of Agentic RAG technology, multi‑agent architectures, vector retrieval breakthroughs, GPU‑accelerated indexing, NL2SQL capabilities, deployment models, and future directions for AI‑driven search solutions.
Introduction
Overview This talk by Xing Shaomin, head of Alibaba Cloud AI Search, deeply analyzes the R&D history of Alibaba Cloud AI Search, key Agentic RAG technologies, product deployment, and future directions, showcasing Alibaba Cloud’s innovation and prospects in AI search.
Table of Contents
1. Introduction to Alibaba Cloud AI Search
2. Agentic RAG Key Technologies
3. Agentic RAG Product Deployment
4. Future Development Directions
Alibaba Cloud AI Search Overview
Alibaba Cloud AI Search offers two main product lines: the open‑source Elasticsearch line and the self‑developed OpenSearch line, which complement each other to provide comprehensive, multi‑layered search solutions for enterprises.
Open-source Elasticsearch Product Line
In 2018 Alibaba Cloud partnered with Elastic to host Elasticsearch on its platform, adding enhancements such as the Indexing Service that separates write and query operations, improving concurrency and query performance. OSS is used as storage to reduce costs, with caching to mitigate latency. The service has evolved to a serverless architecture with high‑performance read‑write separation and intelligent scaling. In the AI search era, Elasticsearch adds vector retrieval, LLM‑plus‑search, RAG Q&A, and AI Assistant capabilities.
Self‑developed OpenSearch Product Line
The OpenSearch line has progressed through three stages:
High‑Performance Search Engine (2008‑2020)
Built a C++‑based engine to meet Alibaba Group’s massive traffic (hundreds of billions of PV per day, millions of QPS during Double‑11, millions of TPS updates, sub‑millisecond latency, 99.999% availability). The engine separates indexing and online services, supports parallel processing, and provides millisecond‑level real‑time indexing.
In November 2022 the engine was open‑sourced under Apache 2.0, attracting many enterprises; for example, Zuoyebang reduced compute resources by 50% after switching.
Semantic Search Stage
Introduced NLP‑based semantic search, supporting industry‑level and scenario‑level model customization. Users can upload data to automatically train tokenizers and ranking models without heavy engineering effort.
Large‑Model‑Based Search Stage
Explored vector‑mixed retrieval, multimodal retrieval, Agentic RAG, and Graph RAG. Both open‑source and self‑developed products advance large‑model search applications.
Agentic RAG Technology Evolution
RAG (Retrieval‑Augmented Generation) combines retrieval and generation. Its evolution includes Native RAG, Advanced RAG, Modular RAG, and Agentic RAG.
Native RAG
Launched after ChatGPT’s rise in early 2023. It adds a large model behind the search system for simple document parsing and retrieval, but performance is limited and unsuitable for production.
Advanced RAG
Optimized document parsing for PDFs, PPTs, etc., with multi‑dimensional slicing and added ReAct for reasoning, making it usable in less strict scenarios.
Modular RAG
Split services (parsing, slicing, indexing) into atomic APIs, allowing customers to pick needed modules via API, improving flexibility.
Agentic RAG
Introduced in H2 2024 to solve multi‑step (multi‑hop) questions. Initially a single‑Agent system handled planning, decomposition, execution, and generation, but struggled with quality. It was refactored into multiple specialized Agents (Planning, Search, DB, Graph, Clarification), forming Agentic RAG 2.0 (DeepSearch).
Agentic RAG 1.0 Architecture and Evaluation
The architecture merges planning and generation into one model. It improves multi‑hop question answering, achieving ~20% higher recall and ~11% higher answer rate on HotpotQA, and 85‑120% recall boost on Musique.
Agentic RAG 2.0 Improvements and Advantages
Key upgrades:
Split the single Agent into specialized Agents (Planning, Search, DB, Graph, Clarification).
Added database, graph, and web search retrieval paths, creating a multi‑route architecture that integrates vector, database, graph, and online data.
Benefits include higher efficiency per task, richer data sources, and more accurate answers, though gains over 1.0 are modest for simple queries.
MCP Protocol
To unify model‑engine calls, the Model Communication Protocol (MCP) was introduced, standardizing interactions across different large‑model vendors and enabling seamless engine integration.
Graph RAG
Parallel path with Agentic RAG for multi‑hop problems. Graph RAG builds a knowledge graph offline, storing entity triples in a vector store for fast online retrieval. It excels when document scale is small and static.
Multimodal Search – Text‑to‑Video
Applies multimodal models to video search (e.g., short‑video platforms). Workflow:
Metadata (title, description, tags) indexed in a text engine.
Video split into streams; VL model generates textual descriptions; subject detection extracts key objects; multimodal vectors are stored in a vector engine.
Search combines text and multimodal vectors, re‑ranks by CTR, and returns results. A planning Agent parses user queries.
Alibaba Cloud AI Search Proprietary Large Model
Multiple specialized agents built on large pre‑trained models, covering document parsing, multimodal vectorization, planning, NL2SQL, reranking, and RAG generation. Models are continuously fine‑tuned for search scenarios.
Model Optimization and Vector Dimensionality Reduction
Dimensionality reduction (e.g., 1024→512) cuts compute cost while preserving performance. Reranker models based on Qwen achieve superior results over competing models.
Vector Retrieval Breakthroughs
Quantization (BBQ) compresses 1024‑dim vectors to 1‑bit, reducing storage from 37 TB to 9 TB across 100 B vectors, with top‑2,500 re‑ranking to recover recall.
GPU‑Accelerated Retrieval
GPU servers (e.g., A10, A100, H100) dramatically speed up index building (20×‑30×) and query throughput, though high QPS is needed to justify GPU cost in serving.
Self‑Developed Vector Store GPU Acceleration
Implemented heterogeneous GPU layers: T4 for 3‑6× query speedup, A100/A800/H100 for 30‑60×. Optimized IVF‑PQ indexing, storage‑compute integration, and dynamic load balancing.
Product Deployment
Two product forms:
Low‑code : Platform configuration enables AI search (RAG) with data source connectors (OSS, HDFS, databases). Users deploy via UI.
High‑code : Core APIs exposed for custom integration (Python, Java, LangChain). Suitable for developers needing flexibility.
Core integrations include deep Elasticsearch/OpenSearch coupling, AI Assistant for index diagnostics, and OpenSearch intelligent Q&A with multimodal capabilities.
Future Development Directions
Deep integration of Agents and Search : Advance Deep Search and multi‑Agent architectures for complex scenarios.
Infrastructure Optimization : Leverage GPU acceleration and vector quantization.
Big Data Fusion : Seamless integration with big‑data platforms.
Open‑source Ecosystem Expansion : Support LangChain, LlamaIndex, DeepSeek, and standardize MCP.
Q&A Highlights
Q1: Should we prioritize performance‑best models (Claude 3.5, Tongyi Qianwen Plus) over cost in Agent planning? A: Use the best‑performing models first to validate effectiveness; cost optimization comes later.
Q2: Is Alibaba Cloud’s PDF parsing based on traditional tools or models like CoPPa? A: Initially traditional parsers with engineering rules; visual models were tried but discarded due to latency.
Q3: Are multi‑Agent and Graph RAG both ways to solve multi‑hop problems? A: Yes; Graph RAG works for small, static corpora, while multi‑Agent is preferred for large, dynamic data.
Q4: Differences between Opensearch NL2SQL and Chat to DB? A: Chat to DB focuses on precise SQL generation for relational databases; Opensearch NL2SQL is a universal NL‑to‑DSL converter supporting ES, OpenSearch, graph queries, and various SQL dialects.
Q5: Key roles in AI search project delivery? A: Data engineers (data ingestion), algorithm engineers (model tuning), technical support, data analysts, and product managers coordinate to ensure end‑to‑end delivery.
Q6: Handling special requirements such as sentiment monitoring? A: Some domains (e.g., medical) demand zero error, which is currently unrealistic for AI models.
Q7: Choosing large models based on data complexity vs. business needs? A: Prioritize business‑driven requirements (tolerance, hallucination risk) over data structure complexity; model fine‑tuning focuses on reducing hallucinations.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.