Building an Enterprise‑Grade RAG 2.0 System: Architecture, Challenges, and Best Practices
This article examines how large‑model shortcomings such as hallucination, staleness, and data‑privacy risks are mitigated by Retrieval‑Augmented Generation, and walks through a layered enterprise‑grade RAG 2.0 design—including offline document parsing, multi‑turn query rewriting, hybrid vector‑plus‑full‑text retrieval, two‑stage ranking, knowledge filtering, and prompt‑driven generation—while sharing concrete model choices, evaluation metrics, and lessons learned.
Background
Large language models (LLMs) excel at generation but suffer from hallucinations, outdated knowledge, and data‑privacy concerns. Retrieval‑Augmented Generation (RAG) addresses these issues by grounding generation in external knowledge sources.
RAG Core Architecture
The system is organized into three vertical layers:
Algorithm layer : OCR, layout analysis, table recognition, and multi‑turn query rewriting.
Process layer : Offline ingestion (document parsing, tokenization, vector indexing) and online answering (query rewrite, hybrid retrieval, ranking, generation). Underlying stores include a vector database, Elasticsearch, and MySQL.
User‑config layer : Knowledge‑base management, model selection, and dialogue rules.
Both offline and online pipelines are illustrated in the accompanying flow diagrams.
Offline Document Processing
Documents (PDF, Word) are parsed with the DeepDoc module of RAGFlow. PDFs require layout recovery, table extraction, and reading‑order reconstruction, while Word files rely on existing structural tags. After layout recovery, data is split in two steps: structural segmentation (title, subtitle, body) followed by length‑based chunking. Chunk sizes around 512 tokens balance retrieval relevance and generation completeness.
Text is tokenized and embedded using two complementary vector models—BGE‑M3 and BCE—selected after relevance benchmarking. These embeddings are written to the vector index for later retrieval.
Online Query Handling
When a user asks a question, the system performs multi‑turn query rewriting using a TPLinker‑based relation‑extraction model to resolve coreferences and fill missing information.
Hybrid retrieval then runs in parallel:
Vector search (semantic similarity, multilingual support, robust to noise).
Full‑text BM25 search (exact keyword matching, high interpretability).
The two result sets are merged, and a two‑stage ranking is applied.
Two‑Stage Ranking
Coarse ranking uses Reciprocal Rank Fusion (RRF) to combine scores from different retrievers without needing comparable raw scores. The top‑20 candidates are passed to a finer‑grained ranker.
Fine ranking employs three models: ColBERT: a late‑interaction dual‑tower model that computes token‑level similarities efficiently.
A cross‑encoder (interactive) model for high‑accuracy re‑ranking at the cost of latency.
A knowledge‑filter classifier (NLI‑based) that discards irrelevant chunks before generation.
RRF and ColBERT are preferred for coarse ranking because they are fast and do not require online model inference, while the cross‑encoder is reserved for the final top‑5 results.
Generation and Prompt Engineering
After ranking, selected knowledge chunks are formatted (knowledge layout) and injected into a prompt template containing a knowledge field and a question field. The LLM then generates an answer.
To improve answer structure and factuality, a two‑stage generation (FoRAG) is used: first a concise outline is produced, then the full answer is expanded based on that outline.
Practical Insights and Challenges
Chunk size trade‑off: smaller chunks improve retrieval precision but may lose context; larger chunks preserve semantics but risk mixing unrelated information.
Tokenizer granularity: jieba and Baidu LAC produce overly fine tokens, texsmart is too coarse, while the cutword tokenizer offers a balanced granularity.
Model selection: despite newer vector models, BGE‑M3 and BCE remain sufficient for the production workload.
Latency mitigation: when hardware limits prevent heavy cross‑encoder use, lightweight models like ColBERT are adopted.
Multimodal extension: audio‑video handling is planned via future multimodal techniques.
Evaluation and Deployment
Before launch, the system undergoes document‑level testing, manual QA against reference answers, and collection of bad‑case/good‑case feedback from multiple departments. Key metrics include bad‑case resolution rate and overall answer accuracy.
Q&A Highlights
When is the system ready for production? After achieving acceptable bad‑case resolution and overall accuracy in internal evaluations.
How to handle incomplete context? Supplement missing layers based on document hierarchy while respecting LLM input limits.
How to reduce latency? Analyze bottlenecks and, if needed, switch to lighter ranking models such as ColBERT.
Beyond chunk size, what else can be optimized? Preserve full document structure during parsing and choose an appropriate chunk length for the target scenario.
How to deal with long tables in PDFs? Currently the whole table is fed to the LLM; finer region extraction is a future improvement.
Conclusion
Building a production‑grade RAG system requires careful attention to every stage—from robust document parsing and query rewriting to hybrid retrieval, multi‑stage ranking, knowledge filtering, and prompt‑driven generation. The layered design, model choices, and evaluation practices described here provide a practical blueprint for enterprise AI deployments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
