Building a Production-Ready RAG Engine for Office Knowledge Retrieval
This article examines the challenges of applying large language models in enterprise settings and presents a detailed, three‑layer RAG architecture—including offline ingestion, hybrid retrieval, multi‑stage ranking, and prompt‑engineered generation—along with practical insights, model choices, and deployment Q&A.
Background
Large language models (LLMs) suffer from hallucination, outdated knowledge, and data‑privacy risks when deployed directly in business applications. Retrieval‑Augmented Generation (RAG) mitigates these issues by coupling external retrieval with LLM generation.
RAG Core Architecture
Traditional RAG comprises indexing, retrieval, and generation. Advanced RAG adds pre‑retrieval query rewriting (e.g., HyDE) and post‑retrieval re‑ranking, filtering, and knowledge routing. The modular diagram illustrates these components.
System Design at China Mobile
The system is organized into three layers:
Algorithm layer: OCR, layout analysis, table recognition, multi‑turn query rewriting, tokenization.
Workflow layer: offline ingestion (document parsing, tokenization, vector creation, index building) and online QA (query rewriting, hybrid retrieval, ranking, generation). Underlying stores include a vector database, Elasticsearch, and MySQL.
Management layer: knowledge‑base, model, and dialogue configuration.
Offline Pipeline
Documents (PDF/Word) are parsed, layout‑recovered, and split into logical chunks. Chunk size balances retrieval recall and generation completeness. Text is tokenized and embedded with BGE‑M3 and BCE vector models, then written to the index.
Online Pipeline
User queries undergo multi‑turn rewriting using a TPLinker‑based relation‑extraction model. Hybrid retrieval combines vector search and BM25 full‑text search; results are merged with Reciprocal Rank Fusion (RRF). A coarse‑ranking stage (e.g., RRF, ColBERT) selects the top‑20 candidates, followed by a fine‑ranking model and knowledge‑filtering via an NLI classifier.
Ranking Models
Coarse ranking uses RRF (rank‑fusion) and ColBERT (late‑interaction dual‑tower). Fine ranking employs an interaction‑based cross‑encoder for higher relevance at the cost of latency.
Generation Enhancements
After ranking, knowledge blocks are formatted and injected into a prompt template containing knowledge and question sections. A two‑stage generation (FoRAG) first produces an outline, then expands it to a full answer, improving structure and factuality.
Practical Insights
Hybrid search improves recall and precision by leveraging semantic and lexical matching.
Chunk size selection (e.g., 256‑512 tokens) is a trade‑off between retrieval accuracy and context completeness.
Knowledge filtering reduces irrelevant content before generation.
Model choice matters: BGE‑M3 + BCE provide complementary embeddings; ColBERT offers fast offline indexing with online token‑level interaction.
System latency can be controlled by selecting lightweight ranking models when resources are limited.
Q&A Highlights
Key deployment questions cover evaluation metrics (bad‑case resolution rate, overall accuracy), context completion strategies, latency mitigation, alternative optimization points beyond chunk size, and handling of multimodal data such as tables and audio‑video.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
