Artificial Intelligence 16 min read

Can Your RAG Pass the Demo? Scaling to 5,000 Docs for Reliable Answers

The article walks through the practical challenges of turning a RAG demo into a production system for 5,000 insurance documents, covering knowledge‑base chunking, embedding model selection, recall‑threshold tuning, hybrid vector‑BM25 retrieval, intent‑aware query routing, prompt constraints, confidence scoring, and operational scaling, with concrete metrics and code examples.

Wu Shixiong's Large Model Academy

Apr 27, 2026

Can Your RAG Pass the Demo? Scaling to 5,000 Docs for Reliable Answers

1. Data Preparation: Knowledge Base Quality Sets RAG Upper Bound

Many start by deploying Milvus or FAISS without first clarifying what they need to retrieve. The core of RAG is the knowledge base, whose quality depends on fine‑grained data processing.

In a financial‑insurance project we received 5,000 PDF files (product manuals, claim terms, training material, case records). The first version extracted text with PyPDF, cut it into 500‑character chunks, and indexed them.

When asked “What is the waiting period for critical‑illness insurance?”, the top‑3 retrieved chunks all mentioned “waiting period” but none answered the question because the chunking split the semantics.

chunk_1: "...保险责任包括但不限于以下情况。等待期内发生的疾病..."（后半句被切掉了）
chunk_2: "...30天、90天、180天三种，具体以合同约定为准..."（前面说的是什么产品，不知道）
chunk_3: "...客户张某在等待期后第5天确诊..."（这是个案例，不是条款）

We switched to a logical‑structure chunking strategy:

PDF structural parsing – detect titles, paragraphs, tables, lists; each title and its content become a chunk; tables become separate chunks; independent list items become separate chunks.

Contextual prefix – prepend a path like “Product: XX Critical Illness > Chapter 3 Insurance Responsibility > 3.2 Waiting Period” to each chunk so the LLM knows the provenance.

Dynamic window – during retrieval also return the preceding and following chunk to restore broken semantics.

After these changes the same query returns a single, complete chunk:

chunk: "产品名称：XX重疾险 > 第三章保险责任 > 3.2等待期规定
本产品等待期为90天。等待期内发生的疾病，保险公司不承担给付责任..."

Recall accuracy improves from 0.62 to 0.84.

2. Retrieval Recall: Prioritize Usefulness Over Similarity

Embedding model choice matters. Using OpenAI text‑embedding‑ada‑002 gave mediocre results on Chinese insurance clauses, mixing “accidental injury” with “accidental medical”. Switching to BGE‑large‑zh (ZhiYuan) raised recall precision because it was trained on large Chinese corpora.

Recall threshold tuning is critical. In our tests:

Threshold 0.65 – recall 0.89, but top‑5 contains two noisy chunks.

Threshold 0.72 – recall 0.84, top‑5 mostly useful.

Threshold 0.78 – recall 0.71, missing edge cases.

We settled on 0.72 to favor precision.

Pure vector search ignores exact keyword matches; for example, a query for claim number A12345 may retrieve generic claim‑process documents but miss the specific record because IDs are poorly embedded.

Hybrid Retrieval: Vector + BM25

We combine vector search with BM25, then rerank with a dedicated model:

# Vector search
vector_results = vector_db.search(query_embedding, top_k=10, threshold=0.72)

# BM25 keyword search
bm25_results = bm25_index.search(query, top_k=10)

# Merge and deduplicate
all_results = merge_and_deduplicate(vector_results, bm25_results)

# Rerank
final_results = reranker.rerank(query, all_results, top_k=5)

This raises recall accuracy from 0.84 to 0.91.

3. Query Understanding: Not Every Question Should Hit Retrieval

Queries fall into several categories: factual Q&A, calculation, database query, time‑constrained request. Sending all queries to vector search leads to two problems: calculation queries get irrelevant policy text, and time‑sensitive queries retrieve outdated documents.

Intent Recognition: Three‑Layer Solution

We use rule‑based matching, a BERT classifier, and fallback to LLM for low‑confidence cases. Rules handle the majority with zero latency; the classifier resolves most remaining cases in tens of milliseconds; only ambiguous queries invoke the LLM.

Routing logic (simplified):

Intent “knowledge Q&A” → vector + BM25 retrieval.

Intent “calculation” → route to calculation module.

Intent “data query” → route to NL2SQL.

Intent “chit‑chat” → direct LLM conversation.

Multi‑index routing further improves precision by selecting a topic‑specific index (e.g., “claims policy”, “sales strategy”, “product info”) after intent detection.

4. Generation Stage: Controlling the LLM

Simply appending retrieved chunks to a prompt leads to “hallucinations”. Initial prompt:

根据以下文档回答用户问题：
{检索结果}
用户问题：{query}

LLM often added vague advice. We switched to a constrained prompt with explicit rules:

你是一个保险知识问答助手。请严格基于以下文档回答用户问题。

【重要规则】
1. 只使用文档中的信息，不要添加额外推理
2. 如果文档中没有明确答案，回复"知识库中暂无相关信息"
3. 不要说"可能""大概""建议咨询"等模糊表达
4. 引用文档时注明来源（文档名称+章节）

文档内容：
{检索结果}
用户问题：{query}

To handle uncertain answers we added confidence scoring (1‑5). If confidence < 3, the system triggers a second retrieval with a rewritten query or a lower threshold.

This reduced “off‑topic” complaints from 18 % to 7 %.

5. Operational Considerations

Knowledge‑base updates – run incremental embedding scripts nightly to refresh changed documents.

Vector‑search performance – partition the Milvus index by business line; searching relevant partitions triples speed for larger corpora.

Prompt formatting – number and label each chunk so the LLM treats them as separate pieces.

Response latency – cache frequent queries and stream LLM output to stay under 10 seconds.

Putting all these pieces together turns a runnable demo into a production‑grade RAG system.

Interview Tips: Discussing RAG Challenges

When asked about RAG difficulties, outline the systemic view, cite concrete pain points (chunking, recall threshold, intent recognition, prompt constraints), and quantify improvements (recall accuracy 0.62 → 0.91, complaint rate 18 % → 7 %). Emphasize both algorithmic and engineering aspects such as document freshness, scaling, and monitoring.

Conclusion

RAG is the most telling component of any large‑model deployment, demanding both NLP expertise and systems engineering. Building a demo is only ~10 % of the work; delivering a stable, accurate service is an order of magnitude harder but far more valuable.

prompt engineering RAG vector database Embedding Intent Recognition Hybrid Retrieval

Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.