Can Your RAG System Pass the Demo and Remain Accurate Across 5,000 Documents?
The article dissects a tough interview question about building a production‑grade Retrieval‑Augmented Generation (RAG) system that not only works in a demo but also delivers stable, correct answers over a knowledge base of 5,000 documents, covering chunking, hybrid retrieval, intent routing, constrained generation, evaluation metrics, and operational safeguards.
Why a Demo Is Not Enough
The interviewer's follow‑up—"Can your RAG reliably answer over 5,000 documents?"—tests whether you understand the gap between a toy demo and a production‑ready pipeline. RAG cannot guarantee 100% correctness; stability means suppressing missed calls, hallucinations, outdated knowledge, and other risks to acceptable business thresholds.
Common Misconceptions
Many assume that scaling from 50 to 5,000 documents is just a matter of increasing top_k from 3 to 10. In reality, larger corpora introduce conflicts, duplicate policies, and timestamp mismatches that cause the model to synthesize incorrect answers.
Another myth is that vector similarity equals business relevance. Dense embeddings often misinterpret identifiers, SKU codes, or clause numbers, leading to semantically close but factually wrong results.
Four Pillars of a Stable RAG
Accurate Retrieval at Scale : Use hybrid retrieval (dense vectors + BM25) with carefully tuned similarity thresholds based on a labeled offline set.
Evidence‑Grounded Generation : Enforce strict prompting rules—include source IDs, timestamps, and require the model to refuse when evidence is missing.
System Robustness : Design for incremental updates, partitioned indexes, caching, streaming responses, and comprehensive monitoring.
Quantifiable Metrics : Track error rate, missed‑call rate, unreferenced answer rate, latency, and cost against a gold‑standard “gold‑label” or “half‑gold” benchmark.
Technical Blueprint
Smart Chunking & Metadata : Parse PDFs structurally (titles, tables, lists), preserve hierarchy, and prepend each chunk with a breadcrumb like "Document > Section > Subsection" to lock context.
Neighbour Expansion : After a leaf chunk matches, retrieve surrounding chunks (k before and after) to provide sufficient context, avoiding isolated sentences that can flip meaning.
#示意:命中叶子块后,按文档序取前后各 k 个兄弟块,拼成生成上下文
def expand_context(matched_id, id_to_node, all_nodes, k=1):
doc_id, pos = id_to_node[matched_id]["doc"], id_to_node[matched_id]["pos"]
seq = [n for n in all_nodes if n["doc"] == doc_id]
seq.sort(key=lambda n: n["pos"])
i = next(j for j, n in enumerate(seq) if n["id"] == matched_id)
lo, hi = max(0, i - k), min(len(seq), i + k + 1)
return "
".join(n["text"] for n in seq[lo:hi])Embedding Selection & Thresholding : Choose language‑specific or domain‑adapted models; evaluate on an internal labeled set to set a similarity threshold that balances recall and precision.
Hybrid Retrieval (Vector + BM25) : Let vectors handle semantic similarity while BM25 captures exact matches for identifiers. Fuse results with Reciprocal Rank Fusion (RRF) as a baseline, then optionally apply learned ranking.
Intent Routing & Multi‑Index Partitioning : Classify queries into intents; route to relevant index partitions (by topic, business line, or time slice) to reduce noise and latency. Only the relevant one or two indexes are searched.
Reranking : Use a cross‑encoder reranker on the top 30‑200 candidates (depending on latency budget) to refine relevance.
Constrained Generation : Prompt the LLM to cite sources with IDs, timestamps, and block boundaries; require a fallback response like "No relevant evidence found" when citations are missing. Add confidence signals (retrieval score, rerank score, evidence overlap) to decide whether to answer, expand the search, or defer to a human.
Operational Concerns
Incremental Updates : Process only new or changed documents (OCR → parse → chunk → embed) using version hashes to avoid full re‑embedding.
Vector Sharding & Caching : Partition vectors and apply filters to shrink candidate space; cache frequent queries after normalizing them.
Streaming Output : Stream tokens only after the reference span is locked to avoid changing citations mid‑response.
Monitoring : Beyond TTFB and 500 errors, watch retrieval hit rate, empty‑result rate, score distributions before/after rerank, citation missing rate, unreferenced answer samples, generation length, second‑question rate, and human‑reviewed factual error rate.
Evaluation Framework
Build a gold‑label set (hundreds of queries covering high‑frequency intents, long documents, multi‑hop reasoning, identifiers, negations, and timelines). Measure Context Recall, Context Precision, Faithfulness, citation coverage, error rate, latency, and cost. Ensure offline metrics correlate with business‑level signals like complaint rate and human fallback frequency.
In summary, a production‑grade RAG combines smart chunking with metadata, hybrid retrieval with RRF fusion, intent‑based routing, constrained generation with evidence checks, and a robust ops stack (incremental updates, sharding, caching, streaming, monitoring). Only by closing the loop between metrics and live traffic can you claim "stable correct answers" over thousands of documents.
Architect's Tech Stack
Java backend, microservices, distributed systems, containerized programming, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
