How to Optimize RAG for Alibaba Interviews? 7 Golden Rules Explained
This article provides a step‑by‑step technical guide to optimizing Retrieval‑Augmented Generation (RAG) for interview scenarios, covering query rewriting, HyDE, fallback strategies, routing and prompt routing, multi‑representation indexing, hybrid retrieval, re‑ranking, self‑RAG, generation control, performance benchmarking, and a practical checklist with concrete code examples and metrics.
1. Query Rewriting
When user queries are vague or use terminology different from the knowledge base, a query‑rewriting layer translates them into forms the retriever can understand.
1.1 Multi‑Query Rewriting
Example: user asks “苹果新品”. The system generates three alternative queries such as <query>Apple 2023发布会</query>, <query>iPhone15配置参数</query>, and <query>苹果秋季新品发布会</query> using LangChain’s MultiQueryRetriever. Results from the multiple queries are fused with Reciprocal Rank Fusion (RRF) where score = 1 / (rank + 60).
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.chat_models import ChatOpenAI
base_retriever = ... # your vector store retriever
retriever = MultiQueryRetriever.from_llm(
llm=ChatOpenAI(temperature=0),
retriever=base_retriever,
prompt="请根据以下问题生成3个不同的查询语句,用<query>分隔:{query}"
)
queries = retriever.get_relevant_documents("苹果新品")
for q in queries:
print(q)1.2 HyDE (Hypothetical Document Embeddings)
HyDE first asks the LLM to generate a plausible answer, then uses that answer as a query to retrieve real documents, bridging the gap between user phrasing and knowledge‑base language. If the similarity between the hypothetical answer and retrieved documents is below 0.7, a second‑pass retrieval is triggered.
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.chat_models import ChatOpenAI
# same setup as above, then:
# generate hypothetical answer and use it for vector search1.3 Question Fallback & Decomposition
For complex or unanswered questions the system either breaks the problem into simpler sub‑questions (serial decomposition) or abstracts it to a broader query (abstract fallback). A decision function selects the appropriate strategy.
def decompose_query(query):
if is_multi_step(query):
return split_into_steps(query)
elif is_cross_domain(query):
return split_into_domains(query)
elif is_too_specific(query):
return generalize_query(query)
else:
return [query]2. Routing Optimization
A metadata routing engine classifies the question (e.g., medical, code, default) and forwards it to the most suitable retriever.
# Simulated lightweight classifier
def llm_classify(query):
if "病" in query or "医生" in query:
return "medical"
elif "代码" in query or "函数" in query:
return "code"
else:
return "default"
def route_query(query):
topic = llm_classify(query)
if topic == "medical":
return medical_retriever(query)
elif topic == "code":
return code_retriever(query)
else:
return default_retriever(query)
print(route_query("我最近总是头痛,可能是什么原因?"))Metrics reported: routing accuracy > 85 % and response latency < 200 ms.
2.1 Dynamic Prompt Routing
Prompt templates (technical, educational, business) are selected automatically based on semantic similarity between the user query and each template. The best template is used when similarity > 0.8.
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
prompt_templates = {
"technical": "请用数学公式和架构图解释{query}",
"educational": "请用通俗易懂的语言解释{query}",
"business": "请用简洁明了的方式总结{query}"
}
user_query = "解释Transformer注意力机制"
query_emb = model.encode([user_query])
template_embs = model.encode(list(prompt_templates.values()))
sim = cosine_similarity(query_emb, template_embs)
best_idx = sim.argmax()
best_template = list(prompt_templates.keys())[best_idx]
best_score = sim[0][best_idx]
if best_score > 0.8:
final_prompt = prompt_templates[best_template].format(query=user_query)
print(f"匹配成功:使用 {best_template} 模板,相似度 {best_score:.2f}")
print("生成的Prompt:", final_prompt)
else:
print("没有找到合适的模板")3. Index Optimization
3.1 Multi‑Representation Index
Each document segment is stored with three embeddings:
Original chunk : raw text split by fixed length.
Summary embedding : AI‑generated abstract.
Question embedding : simulated user questions.
Example for an “Artificial Intelligence” paragraph:
Original: raw paragraph.
Summary: “AI is technology that simulates human intelligence.”
Questions: “What is AI?” “What are AI applications?”
3.2 RAPTOR Hierarchical Index
Long documents are indexed as a four‑level tree (root summary → intermediate nodes → leaf sentences) enabling fast traversal. Two retrieval strategies are compared:
Tree traversal – 92 % recall, ~300 ms, suited for high‑precision needs.
Folded tree – 85 % recall, ~80 ms, suited for real‑time systems.
3.3 Semantic Chunker
Splits text at semantic boundaries using a breakpoint threshold (e.g., 0.7) to avoid breaking sentences.
from langchain_experimental.text_splitter import SemanticChunker
splitter = SemanticChunker(embeddings, breakpoint_threshold=0.7)Chunk‑size recommendations:
text‑embedding‑ada‑002 : 256 tokens with 50 token overlap.
bge‑large‑zh : 512 tokens with 100 token overlap.
4. High‑Level RAG Strategies (Hybrid Retrieval + Re‑Ranking + Self‑RAG)
4.1 Hybrid Retrieval
Combining sparse (BM25) and dense (vector) retrieval yields higher recall. The ensemble is fused with RRF (c = 60).
from langchain.retrievers import EnsembleRetriever, BM25Retriever, DenseRetriever
bm25 = BM25Retriever.from_documents(docs)
dense = DenseRetriever.from_embeddings(embeddings, docs)
ensemble = EnsembleRetriever(
retrievers=[bm25, dense],
weights=[0.4, 0.6],
c=60
)Benchmark: pure vector recall = 72 %; hybrid recall = 89 %.
4.2 Re‑Ranking
Two‑stage ranking: coarse retrieval followed by an LLM‑based re‑ranker. Model trade‑offs:
Cohere rerank – high cost, ★★★★ quality, commercial API.
bge‑reranker – medium cost, ★★★☆ quality, open‑source.
T5‑Encoder – low cost, ★★☆☆ quality, low‑cost scenarios.
4.3 Self‑RAG (Self‑Check)
After generation the system evaluates:
Whether retrieval was needed.
Document relevance (1‑5).
Support evidence (yes/partial/no).
If confidence is low, the system re‑retrieves and re‑generates.
[Retrieval] 是否需要检索? → 是/否
[Relevance] 文档相关度 → 1-5分
[Support] 回答是否有依据 → 是/部分/否5. Generation Control for Industrial Deployment
Prompt engineering adds safety guards and source attribution.
from langchain.prompts import PromptTemplate
safety_prompt = PromptTemplate.from_template("""
你是一名专业助手,回答需满足:
(1) 仅基于以下内容作答: {context}
(2) 不确定时回复'根据现有资料无法确认'
(3) 重要信息标注来源
---
用户问题:{question}
""")Few‑shot examples guide style, and a confidence threshold (<0.65) triggers “cannot confirm” responses.
if max(relevance_scores) < 0.65:
return "根据现有资料无法确认"5.1 CRAG (Four‑Step Correction Framework)
Stages and tools:
Retrieval assessment – check relevance using a MiniLM classifier.
Knowledge refinement – extract key facts with LLM + regex.
Dynamic supplement – fetch missing information via Google Search API.
Final generation – produce answer with source tags using a fixed template and LLM.
6. Performance Benchmarking
Four metrics are reported: accuracy, recall, latency, and a brief note.
Base RAG – 38.2 % accuracy, 64.7 % recall, 120 ms latency (fast but low quality).
+ Query rewrite + Hybrid – 52.1 % accuracy, 81.3 % recall, 180 ms latency (improved via rewriting and mixed retrieval).
+ Self‑RAG + Re‑Rank – 63.7 % accuracy, 89.2 % recall, 240 ms latency (self‑check and re‑ranking boost results).
All strategies – 71.3 % accuracy, 93.5 % recall, 320 ms latency (best quality, highest latency).
Code example for query rewrite + hybrid search:
from query_rewriter import rewrite_query
from retriever import vector_search, keyword_search
def hybrid_search(query):
rewritten = rewrite_query(query)
vector_results = []
keyword_results = []
for q in rewritten:
vector_results.extend(vector_search(q))
keyword_results.extend(keyword_search(q))
combined = merge_and_deduplicate(vector_results, keyword_results)
return combined7. RAG Optimization Checklist
7.1 Index Construction
Clean raw documents (HTML/PDF).
Test chunkers for different content types (narrative, tables, code).
Validate embeddings (e.g., MTEB Chinese benchmark, prefer bge‑large‑zh).
7.2 Query Phase
Install sentence‑transformers for re‑ranking.
Configure HyDE fuse‑break when similarity < 0.7.
Monitor top‑50 recall weekly.
7.3 Generation Phase
Inject source IDs like [1] for traceability.
Hallucination detector threshold < 0.9.
Timeout > 5 s triggers simplified output.
7.4 Pitfalls
Watch Chinese embedding bias (test “芯片” vs “chip”).
Use RAPTOR for long documents instead of simple chunking.
Provide fallback for commercial APIs (e.g., Cohere → local BGE).
7.5 Continuous Improvement
Run regular recall tests, adjust similarity thresholds, and incorporate newer retrieval models.
Tech Freedom Circle
Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
