How to Build a Production‑Ready RAG System for Enterprise Knowledge Workflows
This article explains the challenges of applying large language models in real‑world office scenarios and presents a detailed, step‑by‑step RAG (Retrieval‑Augmented Generation) solution—including architecture, offline document processing, query rewriting, hybrid retrieval, multi‑stage ranking, knowledge filtering, and prompt‑driven generation—backed by practical lessons from a Chinese mobile operator.
Background
Large language models (LLMs) suffer from hallucinations, stale knowledge, and data‑privacy concerns when deployed in enterprise settings. Retrieval‑Augmented Generation (RAG) mitigates these issues by grounding LLM outputs in an external, up‑to‑date knowledge base.
RAG Overview
RAG combines a retriever that fetches relevant passages from a document store with a generator (LLM) that produces answers conditioned on the retrieved context. Advantages: fresh knowledge, easy updates, observable retrieval, reduced hallucinations.
System Architecture
The pipeline is divided into offline indexing and online query serving.
Offline Processing
Document ingestion (PDF, Word, etc.) → OCR, layout analysis, table extraction.
Hierarchical chunking: first split by structural elements (title, subtitle, body), then by token length (e.g., 256‑512 tokens).
Tokenization and embedding using two dense models (BGE‑M3 and BCE) to obtain complementary vectors.
Store raw text in Elasticsearch and vectors in Infinity (or another vector DB).
Online Query Flow
Multi‑turn query rewriting using a TPLinker‑based relation‑extraction model to resolve coreferences and add missing entities.
Hybrid retrieval: dense vector search (semantic similarity) + BM25 full‑text search.
Merge result lists with Reciprocal Rank Fusion (RRF) to produce a unified candidate set.
Two‑stage ranking:
Coarse ranking: RRF (non‑model) and ColBERT (late‑interaction dual‑tower) to select top‑20 passages.
Fine ranking: cross‑encoder re‑ranker evaluates full interaction and returns top‑5 passages.
Knowledge filtering: binary NLI classifier removes passages unrelated to the query.
Prompt construction: format selected passages into a “knowledge” section, append the user question, and feed to the LLM.
FoRAG two‑stage generation: first generate an outline, then expand the outline into the final answer.
Key Components and Choices
Document parser : built on RAGFlow’s DeepDoc module; optimized for PDF (OCR + layout + table recovery) and Word (structure‑preserving split).
Tokenizer : cutword model chosen for balanced granularity (jieba too fine, texsmart too coarse).
Vector models : BGE‑M3 and BCE selected after relevance testing; dual‑model provides complementary recall.
Retriever : hybrid of dense vectors (Infinity) and BM25 (Elasticsearch).
Ranking models : RRF for score‑free fusion, ColBERT for efficient token‑level similarity, cross‑encoder for final relevance.
Knowledge filter : binary NLI classifier trained on domain data.
LLM : any instruction‑tuned model; prompt template includes {knowledge} and {question} sections.
Implementation Details
Example code snippets (illustrative):
# Offline indexing (Python pseudocode)
for doc in documents:
raw = parse(doc) # OCR, layout, table extraction
chunks = structural_split(raw)
chunks = [token_split(c, max_tokens=512) for c in chunks]
for chunk in chunks:
text = chunk.text
vec1 = bge_m3.encode(text)
vec2 = bce.encode(text)
es.index(id=chunk.id, body=text)
infinity.upsert(id=chunk.id, vectors=[vec1, vec2]) # Online query handling
query = rewrite(query_raw) # TPLinker relation extraction
dense_hits = vector_db.search(query, top_k=100)
bm25_hits = es.search(query, top_k=100)
candidates = rrf_merge(dense_hits, bm25_hits, k=60)
top20 = colbert_rank(candidates, query, top_k=20)
top5 = cross_encoder_rank(top20, query, top_k=5)
filtered = nli_filter(top5, query)
prompt = build_prompt(filtered, query)
answer = llm.generate(prompt)Practical Insights
Hybrid retrieval balances semantic coverage (vectors) and exact keyword matching (BM25).
Two‑stage ranking (coarse → fine) dramatically improves relevance while keeping latency acceptable.
Chunk size is a trade‑off: smaller chunks improve retrieval precision; larger chunks preserve context for generation.
Knowledge filtering removes unrelated or noisy passages that even strong rankers may miss.
Iterative A/B testing, metric tracking (bad‑case resolution rate, overall accuracy), and model selection are essential for production‑grade quality.
Conclusion
The described RAG solution demonstrates a complete workflow: document ingestion → hierarchical chunking → dual‑vector indexing → hybrid retrieval → multi‑stage ranking → NLI‑based filtering → structured prompting → two‑stage generation. This blueprint enables enterprise‑level AI assistants that provide up‑to‑date, verifiable answers while mitigating hallucinations and privacy risks.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
