Artificial Intelligence 37 min read

How to Build a Retrieval‑Augmented LLM Knowledge Base on Alibaba Cloud

This guide details a complete end‑to‑end solution for constructing a large‑language‑model knowledge‑base chatbot on Alibaba Cloud, covering background, modular architecture, vector database selection, text preprocessing, embedding models, LLM fine‑tuning, prompt engineering, deployment with PAI‑EAS and BladeLLM, and real‑world results.

Alibaba Cloud Big Data AI Platform

Oct 19, 2023

How to Build a Retrieval‑Augmented LLM Knowledge Base on Alibaba Cloud

Background

Large language models such as ChatGPT and Tongyi Qianwen excel at natural language processing but suffer from factuality and timeliness issues, making them unsuitable for precise customer‑service or knowledge‑base Q&A without external knowledge.

Modular Architecture

The solution follows a modular pipeline: text processing, embedding generation, vector‑search database, LLM instruction fine‑tuning, prompt engineering, and inference deployment.

Vector Search Database Selection

Cloud Database Options

Hologres : Alibaba Cloud’s real‑time data warehouse with integrated Proxima vector engine, supporting high‑throughput, low‑latency queries for large knowledge bases.

Elasticsearch : Fully managed Elasticsearch service with X‑Pack features, suitable for log analysis and multi‑dimensional queries.

AnalyticDB PostgreSQL : Cloud‑native MPP data warehouse compatible with ANSI SQL and PostgreSQL.

Local Database Options

Faiss : Facebook AI Similarity Search, an open‑source library for efficient similarity search on dense vectors.

Text Processing

Key steps include data cleaning, semantic chunking, and QA extraction. Cleaned documents are split into short chunks using CharacterTextSplitter from LangChain:

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size, chunk_overlap)

def split_documents(docs):
    return text_splitter.split_documents(docs)

After chunking, each chunk should receive a concise title or summary for indexing.

Embedding Models

Several open‑source models are recommended:

text2vec

Provides Word2Vec, BERT, Sentence‑BERT, etc. Repository: https://github.com/shibing624/text2vec

SGPT

GPT‑based sentence embeddings. Example code:

import torch
from transformers import AutoModel, AutoTokenizer
from scipy.spatial.distance import cosine

tokenizer = AutoTokenizer.from_pretrained("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit")
model = AutoModel.from_pretrained("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit")
model.eval()

queries = ["I'm searching for a planet not too far from Earth."]
docs = ["Neptune is the eighth ...", "TRAPPIST-1d ...", "A harsh desert world ..."]

# tokenization and weighted‑mean pooling omitted for brevity

BGE

BAAI General Embedding, state‑of‑the‑art Chinese/English semantic vector model. Repository: https://github.com/FlagOpen/FlagEmbedding

from transformers import AutoTokenizer, AutoModel
import torch

sentences = ["样例数据-1", "样例数据-2"]
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
model.eval()
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded_input)
    sentence_embeddings = model_output[0][:, 0]
    sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)

LLM Instruction Fine‑Tuning

When domain‑specific QA data is available, perform supervised fine‑tuning (SFT) using the DeepSpeed‑Chat framework. Example training script snippet:

OUTPUT=/path/to/save
ZERO_STAGE=2
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
    ZERO_STAGE=3
fi
mkdir -p $OUTPUT

deepspeed main.py \
   --data_path /path/to/data \
   --model_name_or_path /path/to/chatglm2-6b \
   --per_device_train_batch_size 4 \
   --learning_rate 9.65e-7 \
   --num_train_epochs 10 \
   --output_dir $OUTPUT \
   |& tee $OUTPUT/training.log

Prompt Engineering

Prompts are crafted to force the LLM to include hyperlinks, commands, or exact QA pairs. Example for extracting hyperlinks:

prompt = 'You are an intelligent assistant. Answer the question using the provided knowledge. If the knowledge contains a web link, output the link exactly.'

Various scenarios (hyperlink extraction, key‑information restoration, code extraction) are demonstrated with before/after examples.

Inference Deployment

Two PAI‑EAS services are deployed:

LangChain main pipeline service (vector search + prompt + LLM).

LLM inference service.

PAI‑EAS provides elastic scaling and blue‑green deployment. BladeLLM accelerates inference and supports streaming output. Example streaming client:

import json
from websockets.sync.client import connect
with connect("ws://localhost:8081/generate_stream") as websocket:
    prompt = "What's the capital of Canada?"
    websocket.send(json.dumps({
        "prompt": prompt,
        "sampling_params": {"temperature": 0.9, "top_p": 0.9, "top_k": 50},
        "stopping_criterial": {"max_new_tokens": 100}
    }))
    while True:
        msg = json.loads(websocket.recv())
        if msg['is_ok']:
            if msg['is_finished']:
                break
            print(msg['tokens'][0]["text"], end="", flush=True)
    print("-" * 40)

Web UI Demo

The UI allows users to configure embedding models, select vector databases, upload knowledge files, and choose between pure vector search, pure LLM generation, or retrieval‑augmented generation.

Case Study: Alibaba Cloud Computing Platform Intelligent Q&A

Traditional keyword‑based Elasticsearch retrieval suffered from low factuality. The new LLM‑augmented system reduced manual answer cost, increased interception rate by over 10 %, and raised answer adoption from <10 % to >70 % in internal trials.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LLM LangChain vector search Cloud Retrieval-Augmented Generation

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.