Building a Production‑Ready RAG Engine: Architecture, Challenges & Solutions

This article examines the practical challenges of deploying Retrieval‑Augmented Generation in enterprise settings, outlines a layered RAG architecture with offline document processing and online query handling, and details the hybrid retrieval, multi‑stage ranking, knowledge filtering, and generation techniques that improve accuracy and reduce hallucinations.

DataFunTalk
DataFunTalk
DataFunTalk
Building a Production‑Ready RAG Engine: Architecture, Challenges & Solutions

Background

Large language models (LLMs) excel at generation but suffer from hallucinations, outdated knowledge, and privacy risks, especially in enterprise applications. Retrieval‑Augmented Generation (RAG) addresses these issues by coupling a retriever with a generator to ground responses in up‑to‑date external data.

Core RAG Architecture

RAG consists of several components: a data source, preprocessing modules, a retriever, a ranker, and a generator. The architecture can be visualized as a modular pipeline where each component can be swapped or upgraded independently.

System Design Overview

The system is organized into three vertical layers:

Algorithm layer – OCR, layout analysis, table extraction, and query rewriting.

Process layer – offline indexing (document parsing, tokenization, vector creation) and online answering (query rewrite, hybrid retrieval, ranking, generation). Underlying storage includes vector databases, Elasticsearch, and relational databases.

Management layer – knowledge‑base administration, model selection, and dialogue rules.

Offline Pipeline

Documents (PDF, Word) are ingested, OCR‑processed, and layout‑analyzed to preserve structural information. The content is then split into hierarchical chunks: first by layout (title, section, paragraph) and then by length (e.g., 256‑token segments) to balance retrieval relevance and generation completeness. Each chunk is indexed both as text (for BM25) and as dense vectors (using BGE‑M3 and BCE models) and stored in a vector store.

Online Pipeline

When a user query arrives, a multi‑turn query‑rewrite module (implemented with TPLinker) expands and disambiguates the request. The rewritten query triggers a hybrid retrieval step that combines dense vector search and BM25 full‑text search. Retrieved candidates are merged and de‑duplicated.

Hybrid Retrieval & Ranking

Hybrid retrieval leverages the semantic coverage of dense vectors and the precision of BM25. To improve ranking, a two‑stage approach is used:

Coarse ranking with Reciprocal Rank Fusion (RRF) merges scores from different retrievers without requiring comparable raw scores.

Fine‑grained re‑ranking using ColBERT (a late‑interaction dual‑encoder) or a cross‑encoder model, which evaluates token‑level interactions for higher relevance.

Knowledge filtering, implemented as a lightweight NLI classifier, removes candidates that are unrelated to the query, further enhancing precision.

Generation

After ranking, the top knowledge chunks are formatted into a prompt template that separates a knowledge section from the question section. The combined prompt is fed to the LLM. To mitigate hallucinations, a two‑stage generation (FoRAG) first produces an outline and then expands it into a full answer, ensuring the output stays grounded in the retrieved evidence.

Model Choices & Evaluation

Various vector models were benchmarked (BGE‑M3, BCE, M3E, GTE). BGE‑M3 and BCE were selected for their complementary strengths. Tokenizers such as jieba, LAC, and texsmart were compared; a medium‑granularity tokenizer (cutword) offered the best trade‑off between recall and precision. Ranking models (RRF, ColBERT, cross‑encoder) were evaluated for latency and accuracy, leading to a hybrid strategy that balances speed and quality.

Practical Insights

Key take‑aways include the importance of fine‑grained chunking, multi‑stage ranking, and knowledge filtering to achieve reliable enterprise‑grade RAG. Production systems must monitor latency, choose lightweight ranking models when resources are limited, and continuously evaluate bad‑case resolution rates.

Q&A Highlights

Typical deployment questions cover readiness metrics (bad‑case resolution, overall accuracy), handling incomplete context (layer‑wise augmentation), latency mitigation (model selection, hardware acceleration), and extending the pipeline to multimodal data such as images or tables.

LLMRAGRankingAI engineeringHybrid RetrievalKnowledge FilteringRetrieval-Augmented Generation
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.