Artificial Intelligence 25 min read

How Alibaba Cloud OpenSearch Powers RAG: Insights from AICon 2024

In this talk, Alibaba Cloud's OpenSearch RAG team shares their year‑long journey of building retrieval‑augmented generation systems, covering data parsing, slicing, vectorization, hybrid retrieval, model fine‑tuning, performance optimizations, cost reduction, and future directions such as multimodal queries and agents.

Alibaba Cloud Big Data AI Platform

Jun 14, 2024

How Alibaba Cloud OpenSearch Powers RAG: Insights from AICon 2024

Introduction

On May 18, 2024, Xing Shaomin, head of OpenSearch R&D at Alibaba, presented "OpenSearch RAG Application Practice" at AICon, describing the company's one‑year exploration of Retrieval‑Augmented Generation (RAG) on Alibaba Cloud.

Background of RAG

Human‑machine dialogue has evolved since the Turing test, with milestones like IBM Watson (2011) and ChatGPT. Large models now enable commercial‑grade dialogue, but generic models lack private enterprise knowledge, leading to the rise of RAG for vertical domains.

Technical Solution

RAG at OpenSearch consists of three pipelines:

Offline pipeline : parse various document formats (text, image, table, code), slice them, and build indexes.

Online pipeline : hybrid retrieval (dense + sparse vectors), re‑ranking, and LLM generation.

Model fine‑tuning : use internal and authorized customer data to SFT models (primarily Tongyi Qianwen, also LLaMA), achieving GPT‑3.5‑level performance with 1B models and GPT‑4‑level performance in specific scenarios.

Key Challenges

Accuracy requirements (often 100% in core scenarios), latency (target 1‑3 seconds), GPU cost, and privacy/security are the main obstacles. High‑precision answers demand accurate document parsing, reliable retrieval, and low‑hallucination LLMs.

Data Parsing and Slicing

Parsing diverse formats (PDF, Word, PPT, JSON, tables, images) is critical; tables and complex diagrams are especially difficult. After parsing, documents are structured into a tree, then sliced at coarse (paragraph) and fine (token or sentence) granularity, ensuring semantic completeness.

Vectorization

The goal is to achieve high‑quality embeddings with smaller models (e.g., 1B model matching 7B performance) because vectorization traffic far exceeds LLM inference traffic. Techniques include model quantization, caching, and multi‑card parallelism.

Query Understanding

Short user queries are expanded via intent detection, coreference resolution, and semantic augmentation before retrieval, reducing latency and cost compared to sending raw queries to the LLM.

NL2SQL / NL2OpenSearch

Intent classification distinguishes analysis queries (handled by NL2SQL or NL2OpenSearch) from pure Q&A, enabling direct database or OpenSearch DSL queries.

Hybrid Retrieval and Re‑ranking

Dense vectors handle fuzzy matching, while sparse vectors ensure precise keyword matches. A re‑ranking model (e.g., bge‑rerank) improves recall by ~20% and answer accuracy by ~12.5%.

Performance Optimizations

Vector engine (VectorStore) was rebuilt with a new framework, reducing engineering overhead. Algorithmic improvements focus on HNSW graph construction and node traversal prediction, cutting traversal nodes by up to 50% without increasing build cost. GPU‑accelerated graph search yields 3‑60× speedups on T4 to A100 cards.

Model Fine‑tuning and Evaluation

SFT uses a mix of open‑source, internal, and customer‑authorized data. Evaluation employs the Ragas framework, AB testing, and daily monitoring to detect regressions. Fine‑tuned models achieve hallucination rates as low as 4.3% (vs. 7.1% for GPT‑4) on internal test sets.

Cost Reduction

Prompt engineering and SFT (full‑parameter or LoRA) are preferred over expensive pre‑training. LoRA enables serving many customer‑specific models on a single GPU, dropping monthly GPU cost from ~4000 CNY to ~100 CNY.

Privacy & Security

Input and output filtering integrates with Alibaba's GreenNet service to ensure data privacy.

Future Directions

Expanding multimodal capabilities beyond text and images to include voice, video, and richer visual inputs; developing agent‑based workflows for task‑oriented scenarios (e.g., automated ECS provisioning, fault diagnosis); and exploring long‑context RAG to balance knowledge‑base vs. knowledge‑parameter approaches.

The vision is to evolve OpenSearch RAG into a full development platform with unified data connectors (MaxCompute, Hologres, HDFS), open‑source engines (Havenask, Elasticsearch), and orchestration layers (LangChain, LlamaIndex).

LLM RAG OpenSearch AI search Hybrid Retrieval Model Fine‑tuning

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.