Building an Enterprise‑Grade RAG 2.0 System: Architecture, Challenges, and Best Practices

This article analyses the practical construction of an enterprise‑level Retrieval‑Augmented Generation (RAG) 2.0 system, covering background issues of large models, a modular architecture, layered offline/online pipelines, hybrid retrieval, ranking strategies, prompt engineering, and deployment insights drawn from China Mobile’s production experience.

DataFunTalk
DataFunTalk
DataFunTalk
Building an Enterprise‑Grade RAG 2.0 System: Architecture, Challenges, and Best Practices

Background and Motivation

Large language models (LLMs) still suffer from hallucinations, stale knowledge, and data‑privacy risks, which hinder their adoption in enterprise settings. Retrieval‑Augmented Generation (RAG) addresses these problems by coupling external knowledge sources with LLMs, offering better factuality, up‑datability, and explainability.

Core RAG 2.0 Architecture

The modular RAG architecture (as illustrated in the

Modular RAG diagram
Modular RAG diagram

) consists of five logical layers:

Algorithm layer : OCR, layout analysis, table recognition, multi‑turn query rewriting, tokenisation, etc.

Process layer : Offline ingestion (document parsing, tokenisation, vectorisation, index building) and online QA (query rewriting, hybrid retrieval, ranking, generation). Underlying stores include vector DB, Elasticsearch, MySQL.

User‑config layer : Knowledge‑base management, model selection, dialogue rules.

Offline Pipeline – Document Parsing and Indexing

Documents (PDF, Word) are first processed by OCR and layout recovery. For PDFs, page‑level image analysis restores structure and extracts tables; for Word files, structural tags are directly used. The pipeline performs two‑step chunking: (1) structural split into headings, sub‑headings, etc.; (2) length‑based split (e.g., 256‑token chunks). Over‑short chunks hurt retrieval recall, while over‑long chunks (e.g., 512 tokens) may lose coherence; a balanced size is chosen empirically.

Both lexical tokenisation and dense vector encoding are applied. After benchmarking several vector models (BGE‑M3, BCE, M3E, GTE), the team selected BGE‑M3 and BCE for complementary retrieval performance.

Online Pipeline – Retrieval, Ranking, and Generation

Hybrid Retrieval : combines vector similarity (semantic matching, multilingual support) with BM25 full‑text search (exact keyword matching). The two result sets are merged, providing both recall and precision.

Two‑Stage Ranking :

Coarse ranking using Reciprocal Rank Fusion (RRF) to fuse scores from different retrievers without requiring comparable raw scores.

Fine‑grained re‑ranking with models such as ColBERT (late‑interaction token‑level scoring) and a cross‑encoder (full interaction). ColBERT processes document vectors offline, enabling fast online query scoring.

Knowledge filtering follows ranking: an NLI‑based binary classifier discards passages that are irrelevant to the query, offering a cheap plug‑in alternative to additional ranking models.

Prompt Engineering & Generation : After ranking, the top‑k passages are formatted into a knowledge‑layout block and inserted into a prompt template with separate knowledge and question fields. To improve factual consistency, a two‑stage generation (FoRAG) first produces an outline, then expands it into the final answer.

Model Choices and Empirical Findings

Segmentation granularity: 256‑token chunks gave the best trade‑off between retrieval recall and LLM input limits.

Tokenizer comparison: jieba and Baidu LAC produced overly fine granularity; texsmart was too coarse; the cutword model achieved a balanced token split.

Ranking models: RRF is lightweight and effective for multi‑path fusion; ColBERT offers strong performance with low latency; cross‑encoder yields the highest accuracy but incurs higher compute cost.

Evaluation & Deployment Insights (Q&A)

Key deployment metrics include bad‑case resolution rate and overall accuracy, assessed via manual QA on a curated question‑answer set. When latency becomes a bottleneck, lighter ranking models (e.g., ColBERT) are preferred. Future work includes multimodal support for image and video content.

Conclusion

Building a production‑grade RAG system requires careful attention to every stage—from robust document parsing and balanced chunking, through hybrid retrieval and multi‑stage ranking, to prompt design and two‑step generation. The described architecture demonstrates how modular design, model benchmarking, and plug‑in components (knowledge filter, RRF) can deliver reliable, enterprise‑ready AI assistants.

RAG pipeline diagram
RAG pipeline diagram
prompt engineeringRAGRetrieval-Augmented GenerationEnterprise AIHybrid RetrievalRanking Models
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.