Building a Production-Ready RAG Engine for Office Knowledge Retrieval

This article examines the challenges of applying large language models in enterprise settings and presents a detailed, three‑layer RAG architecture—including offline ingestion, hybrid retrieval, multi‑stage ranking, and prompt‑engineered generation—along with practical insights, model choices, and deployment Q&A.

DataFunTalk
DataFunTalk
DataFunTalk
Building a Production-Ready RAG Engine for Office Knowledge Retrieval

Background

Large language models (LLMs) suffer from hallucination, outdated knowledge, and data‑privacy risks when deployed directly in business applications. Retrieval‑Augmented Generation (RAG) mitigates these issues by coupling external retrieval with LLM generation.

RAG Core Architecture

Traditional RAG comprises indexing, retrieval, and generation. Advanced RAG adds pre‑retrieval query rewriting (e.g., HyDE) and post‑retrieval re‑ranking, filtering, and knowledge routing. The modular diagram illustrates these components.

Modular RAG architecture diagram
Modular RAG architecture diagram

System Design at China Mobile

The system is organized into three layers:

Algorithm layer: OCR, layout analysis, table recognition, multi‑turn query rewriting, tokenization.

Workflow layer: offline ingestion (document parsing, tokenization, vector creation, index building) and online QA (query rewriting, hybrid retrieval, ranking, generation). Underlying stores include a vector database, Elasticsearch, and MySQL.

Management layer: knowledge‑base, model, and dialogue configuration.

Offline Pipeline

Documents (PDF/Word) are parsed, layout‑recovered, and split into logical chunks. Chunk size balances retrieval recall and generation completeness. Text is tokenized and embedded with BGE‑M3 and BCE vector models, then written to the index.

Online Pipeline

User queries undergo multi‑turn rewriting using a TPLinker‑based relation‑extraction model. Hybrid retrieval combines vector search and BM25 full‑text search; results are merged with Reciprocal Rank Fusion (RRF). A coarse‑ranking stage (e.g., RRF, ColBERT) selects the top‑20 candidates, followed by a fine‑ranking model and knowledge‑filtering via an NLI classifier.

Ranking Models

Coarse ranking uses RRF (rank‑fusion) and ColBERT (late‑interaction dual‑tower). Fine ranking employs an interaction‑based cross‑encoder for higher relevance at the cost of latency.

Generation Enhancements

After ranking, knowledge blocks are formatted and injected into a prompt template containing knowledge and question sections. A two‑stage generation (FoRAG) first produces an outline, then expands it to a full answer, improving structure and factuality.

Practical Insights

Hybrid search improves recall and precision by leveraging semantic and lexical matching.

Chunk size selection (e.g., 256‑512 tokens) is a trade‑off between retrieval accuracy and context completeness.

Knowledge filtering reduces irrelevant content before generation.

Model choice matters: BGE‑M3 + BCE provide complementary embeddings; ColBERT offers fast offline indexing with online token‑level interaction.

System latency can be controlled by selecting lightweight ranking models when resources are limited.

Q&A Highlights

Key deployment questions cover evaluation metrics (bad‑case resolution rate, overall accuracy), context completion strategies, latency mitigation, alternative optimization points beyond chunk size, and handling of multimodal data such as tables and audio‑video.

AIRAGRankingHybrid SearchEnterprise Knowledge RetrievalRetrieval-Augmented Generation
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.