Boost Enterprise RAG: Data Pipeline Tricks, Hybrid Search & Rerank

To make Retrieval‑Augmented Generation reliable in production, the article outlines five key engineering tactics—semantic chunking with metadata, hybrid vector‑keyword search, two‑stage retrieval with reranking, query rewriting and expansion, and dynamic result evaluation—each illustrated with concrete examples and code snippets.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
Boost Enterprise RAG: Data Pipeline Tricks, Hybrid Search & Rerank

1. Semantic Chunking & Metadata Injection

Many beginner RAG projects fail at handling long documents such as PDFs or Word files by naively splitting them into fixed‑length token chunks, which breaks logical structures like tables. In production, the recommended approach is structured perception chunking : first convert documents to a markup format (e.g., Markdown) that preserves headings, paragraphs, and table boundaries, then split by these semantic units.

After chunking, each text block must be enriched with contextual metadata (e.g., source file name, section, equipment name). This metadata is stored together with the content in the vector index so that the LLM receives sufficient context during retrieval.

{"文件":"2023维修手册","章节":"发动机保养","内容":"该设备的维护周期为六个月"}

With metadata, the model can distinguish that the sentence refers to a specific device, avoiding meaningless retrieval results.

2. Hybrid Search

Most open‑source RAG pipelines rely solely on dense vector retrieval, which works well for semantic similarity but struggles with exact business identifiers such as error codes (e.g., Error-0x9F4A). To handle these cases, a dual‑path recall (Hybrid Search) combines dense vector search with traditional keyword search (e.g., BM25 via Elasticsearch). The two result sets are merged using Reciprocal Rank Fusion (RRF), ensuring both semantic matches and precise term matches are covered.

3. Rerank (Two‑Stage Retrieval)

Feeding a large number of retrieved documents directly into the LLM leads to the “Lost in the Middle” problem, where the model forgets key information and generates hallucinations. The standard engineering solution is a two‑stage pipeline:

First stage (coarse ranking): quickly retrieve ~50 candidate documents using vector search and BM25.

Second stage (fine ranking): apply a cross‑encoder reranker (e.g., BGE‑Reranker) to score the candidates precisely and keep only the top 3‑5 documents for the LLM prompt.

This reduces token usage, latency, and cost while preserving relevance.

4. Query Rewriting & Expansion

User queries in real‑world systems are often short, colloquial, and lack context. Before sending a query to the retriever, a lightweight preprocessing layer rewrites it into a self‑contained, context‑rich phrase. A small LLM can combine recent chat history with the current question to produce the rewritten query.

// Pseudo‑code for query rewrite
String userQuery = "密码忘了怎么办";
String chatHistory = "[User: OA系统怎么登录, System: 您可以通过企业微信扫码...]";
String rewrittenQuery = fastLlm.generate(
    "根据以下聊天历史,将用户的最新提问重写为一个具体且独立的查询请求。历史:" + chatHistory + " 提问:" + userQuery
);
// rewrittenQuery becomes: "OA系统密码忘了怎么找回"
List<Document> docs = retrievalPipeline.search(rewrittenQuery);

Beyond rewriting, query expansion can generate several paraphrases of the original question, allowing parallel searches that mitigate missing relevant documents caused by ambiguous phrasing.

5. Dynamic Evaluation of Retrieval Results

Traditional RAG follows a linear pipeline: query → retrieval → generation. If retrieval fails, the generated answer is useless. An advanced architecture adds a feedback loop where the LLM first scores the retrieved documents to decide whether they contain answerable clues. If they do, generation proceeds; otherwise, the system either informs the user of missing knowledge or falls back to an external search engine (e.g., Bing Search API) instead of hallucinating.

If documents contain answer cues → proceed to generation.

If none are relevant → return a “no knowledge” response or invoke an external search.

These five practices—semantic chunking with metadata, hybrid search, two‑stage reranking, query rewriting/expansion, and dynamic evaluation—significantly improve the reliability and usability of enterprise‑grade RAG systems.

metadataRAGRetrieval-Augmented GenerationAI engineeringQuery RewritingHybrid SearchReranking
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.