Engineering and Algorithm Innovations for RAG Engines in Office Applications

This article analyzes the challenges and practical solutions of building a Retrieval‑Augmented Generation (RAG) system for office scenarios, covering background issues, modular architecture, offline and online pipelines, hybrid retrieval, ranking models, knowledge filtering, prompt design, and two‑stage generation techniques.

DataFunTalk
DataFunTalk
DataFunTalk
Engineering and Algorithm Innovations for RAG Engines in Office Applications

Background

Large language models (LLMs) face hallucination, freshness, and data‑privacy problems when deployed in real‑world applications. Retrieval‑Augmented Generation (RAG) addresses these issues by combining external knowledge retrieval with generation.

Core RAG Architecture

The modular RAG architecture consists of data sources, data processing, a retriever, a ranker, and a generator. The left side of the diagram shows the traditional RAG flow (indexing → retrieval → generation). The middle adds query rewrite and HyDE pre‑processing, while the right side expands to modular components such as routing and knowledge guidance.

System Design Overview

Our system is layered:

Algorithm layer: OCR, multi‑turn query rewrite, tokenization, table recognition.

Process layer: offline ingestion (document parsing, tokenization, vector indexing) and online QA (query rewrite, hybrid retrieval, ranking, generation). Underlying stores include vector DB, Elasticsearch, MySQL.

User‑config layer: knowledge‑base management, model management, dialogue rules.

Offline Processing

Documents (PDF, Word) are parsed, split into logical blocks, and indexed both as text and vectors. Chunk size is balanced to avoid loss of context (e.g., 512‑token chunks may cause information loss, while too short chunks hurt retrieval).

Online Processing

When a user query arrives, multi‑turn query rewrite (treated as a relation‑extraction task using TPLinker) enriches the query. Hybrid retrieval combines vector search (semantic similarity, multilingual, multimodal) with BM25 full‑text search. The two result sets are merged using Reciprocal Rank Fusion (RRF) to produce a unified candidate list.

Ranking Models

We employ a two‑stage ranking strategy:

Coarse ranking (RRF) selects the top 20 from 100 candidates without model inference.

Fine‑grained ranking uses ColBERT (late‑interaction dual‑tower model) to compute token‑level similarities, followed by an interactive cross‑encoder for the final top‑5.

ColBERT’s token‑wise vectors preserve semantic detail while remaining efficient for online queries.

Knowledge Filtering

A lightweight NLI‑based binary classifier filters out irrelevant knowledge after ranking, offering a cost‑effective plug‑in compared to training additional ranking models.

Prompt Engineering and Generation

Selected knowledge chunks are formatted into a “knowledge” section of a prompt template, combined with the user question, and fed to the LLM. To improve answer structure, we adopt a two‑stage FoRAG approach: first generate an outline, then expand it into the final answer, ensuring alignment with the query and retrieved context.

Evaluation and Insights

Key observations include:

Document parsing quality directly impacts RAG performance; PDF requires OCR and layout recovery, while Word retains structural tags.

Segmentation granularity (e.g., 128 vs. 512 tokens) trades off retrieval recall against context completeness.

Hybrid retrieval outperforms pure vector or pure BM25 search by leveraging semantic breadth and exact matching.

RRF provides a fast, model‑free fusion method, whereas ColBERT balances speed and accuracy for coarse ranking.

Knowledge filtering reduces hallucinations by discarding mismatched passages.

Summary

Building a production‑grade RAG system for office knowledge bases requires careful engineering at every layer: robust document parsing, effective query rewrite, hybrid retrieval, multi‑stage ranking, knowledge filtering, and structured prompt design. Combining these components yields “search‑more‑comprehensive”, “rank‑more‑accurate”, and “answer‑more‑precise” outcomes.

Q&A Highlights

Q1: Launch criteria focus on bad‑case resolution rate and overall accuracy measured through manual QA.

Q2: Context gaps are filled by aggregating sibling and parent layers while respecting model input limits.

Q3: Latency is mitigated by selecting lightweight rankers (e.g., ColBERT) when hardware is constrained.

Q4: Chunk size and parsing fidelity are primary optimization levers.

Q5: Multimodal support (image/audio) is planned for future extensions.

Q6: Current approach feeds entire tables to the LLM; precise region extraction remains an open challenge.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIPrompt EngineeringRAGDocument ParsingRetrieval Augmented GenerationHybrid RetrievalRanking ModelsKnowledge Filtering
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.