Building a Production‑Ready RAG System for Enterprise Knowledge Work
This article analyzes the challenges and practical solutions of deploying Retrieval‑Augmented Generation (RAG) in an enterprise office setting, covering background problems, modular architecture, offline and online pipelines, hybrid retrieval, multi‑stage ranking, knowledge filtering, prompt engineering, and model selection to achieve accurate, reliable answers.
Background
Large language models (LLMs) face three major obstacles in real‑world deployment: hallucinations, outdated knowledge, and data‑privacy risks. Retrieval‑Augmented Generation (RAG) mitigates these issues by coupling a retriever with a generator, allowing external knowledge to be injected at inference time.
Core RAG Architecture
The modular RAG design consists of multiple components: data sources, preprocessing modules, a retriever, a ranker, and a generator. The architecture evolves from a basic pipeline (index → retrieve → generate) to an advanced version that adds query rewriting, HyDE‑style synthetic document generation, post‑retrieval re‑ranking, and knowledge filtering.
System Design Overview
The system is organized in three layers:
Algorithm layer : OCR, layout analysis, table recognition, multi‑turn query rewriting, and tokenization.
Process layer : Offline indexing (document parsing, tokenization, vector creation) and online QA (query rewriting, hybrid retrieval, ranking, generation). Underlying storage includes vector databases, Elasticsearch, and MySQL.
Management layer : Knowledge‑base administration, model versioning, and dialogue generation rules.
Offline Processing
Documents (PDF, Word) are parsed, layout‑recovered, and split into logical chunks. Each chunk is further divided by length to balance retrieval precision and context completeness. Text chunks are tokenized and embedded using two complementary vector models (BGE‑M3 and BCE) before being written to the index.
Online Query Handling
When a user asks a question, multi‑turn query rewriting (treated as a relation‑extraction task using TPLinker) resolves coreferences and fills missing information. The rewritten query then drives a hybrid retrieval that combines vector similarity with BM25 full‑text search.
Hybrid Retrieval and Ranking
Hybrid retrieval merges results from vector and BM25 searches, leveraging the strengths of semantic similarity and exact keyword matching. After retrieving the top 100 candidates, a two‑stage ranking is applied:
Coarse ranking using Reciprocal Rank Fusion (RRF) to quickly narrow to the top 20.
Fine ranking with models such as ColBERT (late‑interaction dual‑tower) and a cross‑encoder re‑ranker to select the final 5 most relevant chunks.
Knowledge filtering, implemented as a lightweight NLI classifier, removes irrelevant or unsafe content before generation.
Prompt Engineering and Generation
Selected knowledge chunks are formatted into a prompt template that separates knowledge and question sections. A two‑stage generation (FoRAG) first produces an outline and then expands it, improving answer structure and factual consistency.
Model Choices and Evaluation
Segmentation granularity (e.g., 128, 256, 512 tokens) is tuned to balance retrieval recall and generation fidelity. Experiments showed that jieba and Baidu LAC produce overly fine tokenization, while texsmart is too coarse; the cutword model offers a practical middle ground. For vector embeddings, BGE‑M3 and BCE were selected for their complementary performance; newer models did not yield significant gains in the target scenario.
Key Takeaways
Building a basic RAG pipeline is straightforward, but achieving production quality requires careful attention to each component.
Effective retrieval depends on robust document parsing, multi‑turn query rewriting, and hybrid search.
Two‑stage ranking (coarse + fine) and knowledge filtering dramatically improve relevance and safety.
Prompt structuring and outline‑guided generation enhance answer accuracy and readability.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
