How to Build a Retrieval‑Augmented LLM Knowledge Base on Alibaba Cloud
This guide details a complete end‑to‑end solution for constructing a large‑language‑model knowledge‑base chatbot on Alibaba Cloud, covering background, modular architecture, vector database selection, text preprocessing, embedding models, LLM fine‑tuning, prompt engineering, deployment with PAI‑EAS and BladeLLM, and real‑world results.
Background
Large language models such as ChatGPT and Tongyi Qianwen excel at natural language processing but suffer from factuality and timeliness issues, making them unsuitable for precise customer‑service or knowledge‑base Q&A without external knowledge.
Modular Architecture
The solution follows a modular pipeline: text processing, embedding generation, vector‑search database, LLM instruction fine‑tuning, prompt engineering, and inference deployment.
Vector Search Database Selection
Cloud Database Options
Hologres : Alibaba Cloud’s real‑time data warehouse with integrated Proxima vector engine, supporting high‑throughput, low‑latency queries for large knowledge bases.
Elasticsearch : Fully managed Elasticsearch service with X‑Pack features, suitable for log analysis and multi‑dimensional queries.
AnalyticDB PostgreSQL : Cloud‑native MPP data warehouse compatible with ANSI SQL and PostgreSQL.
Local Database Options
Faiss : Facebook AI Similarity Search, an open‑source library for efficient similarity search on dense vectors.
Text Processing
Key steps include data cleaning, semantic chunking, and QA extraction. Cleaned documents are split into short chunks using CharacterTextSplitter from LangChain:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size, chunk_overlap)
def split_documents(docs):
return text_splitter.split_documents(docs)After chunking, each chunk should receive a concise title or summary for indexing.
Embedding Models
Several open‑source models are recommended:
text2vec
Provides Word2Vec, BERT, Sentence‑BERT, etc. Repository: https://github.com/shibing624/text2vec
SGPT
GPT‑based sentence embeddings. Example code:
import torch
from transformers import AutoModel, AutoTokenizer
from scipy.spatial.distance import cosine
tokenizer = AutoTokenizer.from_pretrained("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit")
model = AutoModel.from_pretrained("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit")
model.eval()
queries = ["I'm searching for a planet not too far from Earth."]
docs = ["Neptune is the eighth ...", "TRAPPIST-1d ...", "A harsh desert world ..."]
# tokenization and weighted‑mean pooling omitted for brevityBGE
BAAI General Embedding, state‑of‑the‑art Chinese/English semantic vector model. Repository: https://github.com/FlagOpen/FlagEmbedding
from transformers import AutoTokenizer, AutoModel
import torch
sentences = ["样例数据-1", "样例数据-2"]
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
model.eval()
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = model_output[0][:, 0]
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)LLM Instruction Fine‑Tuning
When domain‑specific QA data is available, perform supervised fine‑tuning (SFT) using the DeepSpeed‑Chat framework. Example training script snippet:
OUTPUT=/path/to/save
ZERO_STAGE=2
if [ "$OUTPUT" == "" ]; then
OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
ZERO_STAGE=3
fi
mkdir -p $OUTPUT
deepspeed main.py \
--data_path /path/to/data \
--model_name_or_path /path/to/chatglm2-6b \
--per_device_train_batch_size 4 \
--learning_rate 9.65e-7 \
--num_train_epochs 10 \
--output_dir $OUTPUT \
|& tee $OUTPUT/training.logPrompt Engineering
Prompts are crafted to force the LLM to include hyperlinks, commands, or exact QA pairs. Example for extracting hyperlinks:
prompt = 'You are an intelligent assistant. Answer the question using the provided knowledge. If the knowledge contains a web link, output the link exactly.'Various scenarios (hyperlink extraction, key‑information restoration, code extraction) are demonstrated with before/after examples.
Inference Deployment
Two PAI‑EAS services are deployed:
LangChain main pipeline service (vector search + prompt + LLM).
LLM inference service.
PAI‑EAS provides elastic scaling and blue‑green deployment. BladeLLM accelerates inference and supports streaming output. Example streaming client:
import json
from websockets.sync.client import connect
with connect("ws://localhost:8081/generate_stream") as websocket:
prompt = "What's the capital of Canada?"
websocket.send(json.dumps({
"prompt": prompt,
"sampling_params": {"temperature": 0.9, "top_p": 0.9, "top_k": 50},
"stopping_criterial": {"max_new_tokens": 100}
}))
while True:
msg = json.loads(websocket.recv())
if msg['is_ok']:
if msg['is_finished']:
break
print(msg['tokens'][0]["text"], end="", flush=True)
print("-" * 40)Web UI Demo
The UI allows users to configure embedding models, select vector databases, upload knowledge files, and choose between pure vector search, pure LLM generation, or retrieval‑augmented generation.
Case Study: Alibaba Cloud Computing Platform Intelligent Q&A
Traditional keyword‑based Elasticsearch retrieval suffered from low factuality. The new LLM‑augmented system reduced manual answer cost, increased interception rate by over 10 %, and raised answer adoption from <10 % to >70 % in internal trials.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
