Artificial Intelligence 12 min read

How to Build an Enterprise‑Grade Intelligent Document QA System with Everything plus RAG

This article walks through the need for fast, accurate answers from massive document collections, compares plain keyword search and pure LLM chat, and presents a hybrid Retrieval‑Augmented Generation solution built with open‑source components, detailing architecture, hybrid retrieval, prompt engineering, deployment, performance tuning, and common pitfalls.

Mingyi World Elasticsearch

Dec 20, 2025

How to Build an Enterprise‑Grade Intelligent Document QA System with Everything plus RAG

Technical staff often need to locate precise answers across large collections of manuals, API references, notes, code, and configuration files. A hybrid Retrieval‑Augmented Generation (RAG) pipeline combines fast keyword search with semantic vector search, ensuring answers are both quick and grounded in the source documents.

RAG workflow

Retrieval : find relevant passages from the document store.

Augmentation : feed the retrieved passages to the LLM as context.

Generation : let the LLM produce an answer using only that context.

System architecture

Backend : Python + Flask (lightweight HTTP service).

Search engine : Elasticsearch 9.x or Easysearch 2.0+ (supports BM25 and dense‑vector retrieval).

Vector model : Sentence‑Transformers (open‑source embedding model).

LLM : DeepSeek or Ollama (local, Chinese‑friendly models).

Database : MySQL (user management).

Hybrid retrieval: BM25 + vector

Pure BM25 provides fast exact‑match results but ignores semantics. Pure dense‑vector search captures semantic similarity but is slower. The system merges the two result lists with Reciprocal Rank Fusion (RRF):

# Core formula
RRF_score = 1/(k + rank_BM25) + 1/(k + rank_vector)

# Example (k=60)
Document A: rank_BM25=1, rank_vector=3 → score≈0.0323
Document B: rank_BM25=5, rank_vector=1 → score≈0.0318

Documents appearing in both lists receive higher scores, improving relevance.

Comparison of retrieval methods

BM25 : fast, exact keyword match; does not understand semantics; ideal for code, API names, or exact terms.

Vector search : semantic similarity, finds related concepts; slower computation; ideal for concept search.

Query rewriting

Short user queries often retrieve few results. The system automatically expands a query into multiple variants and searches each variant before fusing the results.

Original query: "系统怎么用"
Variants:
1. "系统怎么用 详细说明"
2. "什么是 系统怎么用"
3. "系统怎么用 步骤"

Prompt engineering

The prompt forces the LLM to answer strictly from the retrieved context and to cite the source block.

prompt = f"""
You are a document‑question answering assistant. Answer the question strictly based on the following context:

[Block1] {doc1_content}
[Block2] {doc2_content}
[Block3] {doc3_content}

Rules:
1. Use only the above information.
2. If the answer is not in the context, reply "No relevant information in the documents".
3. Cite the source block, e.g., "According to [Block1]...".

Question: {user_question}

Answer:
"""

Disk‑scanning feature (Everything‑like)

The system can scan local disks, automatically index supported file types (30+ formats such as TXT, MD, PDF, DOCX, XLSX, PPTX, PY, JS, JAVA, CPP, GO, PHP, TS, RS, JSON, YAML, XML, INI, CONF, ENV, HTML, CSS, SCSS, SH, BAT, PS1), and perform incremental updates by checking file modification timestamps.

# Incremental update logic
if file.mtime > last_index_time:
    re_index(file)   # file changed, re‑index
else:
    skip(file)       # unchanged, skip

Deployment steps

Install Python dependencies: pip install -r requirements.txt Start Elasticsearch/Easysearch (version 9.x or 2.0+).

Install MySQL 5.7+ and run python init_database.py to create tables.

Launch the service: python app.py and open http://localhost:16666.

Upload documents via the UI or trigger a disk scan; the system indexes files automatically.

Performance optimizations

Batch queries with Elasticsearch _msearch API.

Cache identical queries to avoid recomputation.

Frontend pagination to limit result payload.

Bulk indexing using _bulk (100 documents per batch).

Asynchronous background indexing for uploaded files.

Incremental updates process only modified files.

Batch vector generation, e.g., model.encode(texts, batch_size=32), speeds up embedding by >10×.

Common pitfalls and fixes

Vector dimension mismatch

Elasticsearch mapping must match the embedding model dimension. Example for all‑MiniLM‑L6‑v2 (384‑dim):

# Verify model dimension
model = SentenceTransformer('all-MiniLM-L6-v2')
print(model.get_sentence_embedding_dimension())  # 384

# Correct mapping
mapping = {
    "vector": {"type": "dense_vector", "dims": 384}
}

PDF text extraction garbled

Replace PyPDF2 with pdfplumber for better Chinese support:

import pdfplumber
with pdfplumber.open(pdf_path) as pdf:
    text = ""
    for page in pdf.pages:
        text += page.extract_text()

Long‑document retrieval loss

Use overlapping chunking to preserve context:

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50, length_function=len)

Key takeaways

RAG consists of three simple steps: retrieve, augment, generate.

Hybrid retrieval (BM25 + vector) leverages the speed of keyword matching and the semantic power of dense vectors.

Well‑crafted prompts prevent hallucinations and keep answers grounded in the retrieved context.

Engineering concerns—performance tuning, error handling, and user experience—are essential for a production‑ready system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Prompt Engineering Elasticsearch RAG Hybrid Retrieval Sentence-Transformers

Written by

Mingyi World Elasticsearch

The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

RAG workflow

System architecture

Hybrid retrieval: BM25 + vector

Comparison of retrieval methods

Query rewriting

Prompt engineering

Disk‑scanning feature (Everything‑like)

Deployment steps

Performance optimizations

Common pitfalls and fixes

Vector dimension mismatch

PDF text extraction garbled

Long‑document retrieval loss

Key takeaways

Mingyi World Elasticsearch

How this landed with the community

Was this worth your time?

0 Comments

Hybrid retrieval: BM25 + vector