Artificial Intelligence 33 min read

How Alibaba Cloud’s AI Search Evolves with Agentic RAG and Multi‑Model Innovations

This article details Alibaba Cloud AI Search’s development journey, covering its dual product lines, the evolution of Agentic RAG technology, multi‑agent architectures, vector retrieval breakthroughs, GPU‑accelerated indexing, NL2SQL capabilities, deployment models, and future directions for AI‑driven search solutions.

DataFunSummit

Jun 12, 2025

How Alibaba Cloud’s AI Search Evolves with Agentic RAG and Multi‑Model Innovations

Introduction

Overview This talk by Xing Shaomin, head of Alibaba Cloud AI Search, deeply analyzes the R&D history of Alibaba Cloud AI Search, key Agentic RAG technologies, product deployment, and future directions, showcasing Alibaba Cloud’s innovation and prospects in AI search.

1. Introduction to Alibaba Cloud AI Search

2. Agentic RAG Key Technologies

3. Agentic RAG Product Deployment

4. Future Development Directions

Alibaba Cloud AI Search Overview

Alibaba Cloud AI Search offers two main product lines: the open‑source Elasticsearch line and the self‑developed OpenSearch line, which complement each other to provide comprehensive, multi‑layered search solutions for enterprises.

Open-source Elasticsearch Product Line

In 2018 Alibaba Cloud partnered with Elastic to host Elasticsearch on its platform, adding enhancements such as the Indexing Service that separates write and query operations, improving concurrency and query performance. OSS is used as storage to reduce costs, with caching to mitigate latency. The service has evolved to a serverless architecture with high‑performance read‑write separation and intelligent scaling. In the AI search era, Elasticsearch adds vector retrieval, LLM‑plus‑search, RAG Q&A, and AI Assistant capabilities.

Self‑developed OpenSearch Product Line

The OpenSearch line has progressed through three stages:

High‑Performance Search Engine (2008‑2020)

Built a C++‑based engine to meet Alibaba Group’s massive traffic (hundreds of billions of PV per day, millions of QPS during Double‑11, millions of TPS updates, sub‑millisecond latency, 99.999% availability). The engine separates indexing and online services, supports parallel processing, and provides millisecond‑level real‑time indexing.

In November 2022 the engine was open‑sourced under Apache 2.0, attracting many enterprises; for example, Zuoyebang reduced compute resources by 50% after switching.

Semantic Search Stage

Introduced NLP‑based semantic search, supporting industry‑level and scenario‑level model customization. Users can upload data to automatically train tokenizers and ranking models without heavy engineering effort.

Large‑Model‑Based Search Stage

Explored vector‑mixed retrieval, multimodal retrieval, Agentic RAG, and Graph RAG. Both open‑source and self‑developed products advance large‑model search applications.

Agentic RAG Technology Evolution

RAG (Retrieval‑Augmented Generation) combines retrieval and generation. Its evolution includes Native RAG, Advanced RAG, Modular RAG, and Agentic RAG.

Native RAG

Launched after ChatGPT’s rise in early 2023. It adds a large model behind the search system for simple document parsing and retrieval, but performance is limited and unsuitable for production.

Advanced RAG

Optimized document parsing for PDFs, PPTs, etc., with multi‑dimensional slicing and added ReAct for reasoning, making it usable in less strict scenarios.

Modular RAG

Split services (parsing, slicing, indexing) into atomic APIs, allowing customers to pick needed modules via API, improving flexibility.

Agentic RAG

Introduced in H2 2024 to solve multi‑step (multi‑hop) questions. Initially a single‑Agent system handled planning, decomposition, execution, and generation, but struggled with quality. It was refactored into multiple specialized Agents (Planning, Search, DB, Graph, Clarification), forming Agentic RAG 2.0 (DeepSearch).

Agentic RAG 1.0 Architecture and Evaluation

The architecture merges planning and generation into one model. It improves multi‑hop question answering, achieving ~20% higher recall and ~11% higher answer rate on HotpotQA, and 85‑120% recall boost on Musique.

Agentic RAG 2.0 Improvements and Advantages

Key upgrades:

Split the single Agent into specialized Agents (Planning, Search, DB, Graph, Clarification).

Added database, graph, and web search retrieval paths, creating a multi‑route architecture that integrates vector, database, graph, and online data.

Benefits include higher efficiency per task, richer data sources, and more accurate answers, though gains over 1.0 are modest for simple queries.

MCP Protocol

To unify model‑engine calls, the Model Communication Protocol (MCP) was introduced, standardizing interactions across different large‑model vendors and enabling seamless engine integration.

Graph RAG

Parallel path with Agentic RAG for multi‑hop problems. Graph RAG builds a knowledge graph offline, storing entity triples in a vector store for fast online retrieval. It excels when document scale is small and static.

Multimodal Search – Text‑to‑Video

Applies multimodal models to video search (e.g., short‑video platforms). Workflow:

Metadata (title, description, tags) indexed in a text engine.

Video split into streams; VL model generates textual descriptions; subject detection extracts key objects; multimodal vectors are stored in a vector engine.

Search combines text and multimodal vectors, re‑ranks by CTR, and returns results. A planning Agent parses user queries.

Alibaba Cloud AI Search Proprietary Large Model

Multiple specialized agents built on large pre‑trained models, covering document parsing, multimodal vectorization, planning, NL2SQL, reranking, and RAG generation. Models are continuously fine‑tuned for search scenarios.

Model Optimization and Vector Dimensionality Reduction

Dimensionality reduction (e.g., 1024→512) cuts compute cost while preserving performance. Reranker models based on Qwen achieve superior results over competing models.

Vector Retrieval Breakthroughs

Quantization (BBQ) compresses 1024‑dim vectors to 1‑bit, reducing storage from 37 TB to 9 TB across 100 B vectors, with top‑2,500 re‑ranking to recover recall.

GPU‑Accelerated Retrieval

GPU servers (e.g., A10, A100, H100) dramatically speed up index building (20×‑30×) and query throughput, though high QPS is needed to justify GPU cost in serving.

Self‑Developed Vector Store GPU Acceleration

Implemented heterogeneous GPU layers: T4 for 3‑6× query speedup, A100/A800/H100 for 30‑60×. Optimized IVF‑PQ indexing, storage‑compute integration, and dynamic load balancing.

Product Deployment

Two product forms:

Low‑code : Platform configuration enables AI search (RAG) with data source connectors (OSS, HDFS, databases). Users deploy via UI.

High‑code : Core APIs exposed for custom integration (Python, Java, LangChain). Suitable for developers needing flexibility.

Core integrations include deep Elasticsearch/OpenSearch coupling, AI Assistant for index diagnostics, and OpenSearch intelligent Q&A with multimodal capabilities.

Future Development Directions

Deep integration of Agents and Search : Advance Deep Search and multi‑Agent architectures for complex scenarios.

Infrastructure Optimization : Leverage GPU acceleration and vector quantization.

Big Data Fusion : Seamless integration with big‑data platforms.

Open‑source Ecosystem Expansion : Support LangChain, LlamaIndex, DeepSeek, and standardize MCP.

Q&A Highlights

Q1: Should we prioritize performance‑best models (Claude 3.5, Tongyi Qianwen Plus) over cost in Agent planning? A: Use the best‑performing models first to validate effectiveness; cost optimization comes later.

Q2: Is Alibaba Cloud’s PDF parsing based on traditional tools or models like CoPPa? A: Initially traditional parsers with engineering rules; visual models were tried but discarded due to latency.

Q3: Are multi‑Agent and Graph RAG both ways to solve multi‑hop problems? A: Yes; Graph RAG works for small, static corpora, while multi‑Agent is preferred for large, dynamic data.

Q4: Differences between Opensearch NL2SQL and Chat to DB? A: Chat to DB focuses on precise SQL generation for relational databases; Opensearch NL2SQL is a universal NL‑to‑DSL converter supporting ES, OpenSearch, graph queries, and various SQL dialects.

Q5: Key roles in AI search project delivery? A: Data engineers (data ingestion), algorithm engineers (model tuning), technical support, data analysts, and product managers coordinate to ensure end‑to‑end delivery.

Q6: Handling special requirements such as sentiment monitoring? A: Some domains (e.g., medical) demand zero error, which is currently unrealistic for AI models.

Q7: Choosing large models based on data complexity vs. business needs? A: Prioritize business‑driven requirements (tolerance, hallucination risk) over data structure complexity; model fine‑tuning focuses on reducing hallucinations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU Acceleration RAG large models Vector Retrieval OpenSearch AI Search

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Table of Contents

Alibaba Cloud AI Search Overview

Open-source Elasticsearch Product Line

Self‑developed OpenSearch Product Line

High‑Performance Search Engine (2008‑2020)

Semantic Search Stage

Large‑Model‑Based Search Stage

Agentic RAG Technology Evolution

Native RAG

Advanced RAG

Modular RAG

Agentic RAG

Agentic RAG 1.0 Architecture and Evaluation

Agentic RAG 2.0 Improvements and Advantages

MCP Protocol

Graph RAG

Multimodal Search – Text‑to‑Video

Alibaba Cloud AI Search Proprietary Large Model

Model Optimization and Vector Dimensionality Reduction

Vector Retrieval Breakthroughs

GPU‑Accelerated Retrieval

Self‑Developed Vector Store GPU Acceleration

Product Deployment

Future Development Directions

Q&A Highlights

DataFunSummit

How this landed with the community

Was this worth your time?

0 Comments