Build a Zero‑Cost Open‑Source RAG Smart Document Q&A System from Scratch
This guide walks through building an open‑source Retrieval‑Augmented Generation (RAG) system that indexes local files with Everything, uses hybrid BM25‑vector search via Elasticsearch, and answers questions with a local LLM, covering architecture, core techniques, deployment steps, performance tweaks, and common pitfalls.
Why a RAG‑enabled document Q&A system?
Technical professionals often face massive piles of documentation—API manuals, project notes, code and config files—making it hard to locate precise answers quickly. Traditional full‑text search (e.g., Elasticsearch) is fast but lacks semantic understanding, while pure LLM chat (ChatGPT, Claude, etc.) can hallucinate and ignore the actual documents. Combining both approaches yields fast, accurate, and reliable answers.
What is RAG?
RAG (Retrieval‑Augmented Generation) works like an open‑book exam: the AI first retrieves relevant passages from a document store, then generates an answer based on that context. The three core steps are:
Retrieval : locate relevant chunks in the document library.
Augmentation : feed the retrieved chunks to the LLM.
Generation : let the LLM produce a response grounded in the supplied context.
System Architecture
The "Everything plus" system follows a classic three‑layer design:
Backend : Python + Flask (lightweight HTTP API).
Search Engine : Elasticsearch 9.x or Easysearch 2.0+ (supports vector search).
Vector Model : Sentence‑Transformers (open‑source embedding model).
LLM : DeepSeek or Ollama (local, Chinese‑friendly models).
Database : MySQL for user management.
Core Techniques
Hybrid retrieval combines BM25 keyword matching with dense‑vector similarity to leverage the strengths of both methods. The results are merged using Reciprocal Rank Fusion (RRF):
# Core formula
RRF_score = 1/(k + rank1) + 1/(k + rank2)
# Example
rank1=1, rank2=3 → score≈0.0323
rank1=5, rank2=1 → score≈0.0318Query rewriting expands a short user query into multiple variants to improve recall, and a carefully crafted prompt forces the LLM to answer strictly from the retrieved context, preventing hallucinations.
prompt = f"""
You are a document‑assistant. Answer only using the following context:
[Block1] {doc1_content}
[Block2] {doc2_content}
[Block3] {doc3_content}
Rules:
1. Use only the above information.
2. If missing, say "No relevant info in documents".
3. Cite sources, e.g., "According to [Block1]...".
Question: {user_question}
Answer:
"""Key Features
Inspired by the Everything desktop search tool, the system can scan local disks, automatically index supported file types (30+ formats, including TXT, PDF, DOCX, code files, JSON, etc.), and perform intelligent incremental updates to avoid re‑processing unchanged files.
# Incremental update example
if file.mtime > last_index_time:
re_index(file)
else:
skip(file)Deployment Steps
Install Python dependencies: pip install -r requirements.txt Start Elasticsearch/Easysearch (version 9.0 or compatible).
Install MySQL 5.7+ and run python init_database.py to create tables.
Run the application: python app.py and open http://localhost:16666.
Upload documents via the web UI or trigger a disk scan; the system indexes files automatically.
Ask questions; the system returns answers with cited source blocks.
Performance Optimizations
Batch queries using Elasticsearch _msearch API.
Cache identical queries.
Frontend pagination to limit result size.
Bulk indexing with _bulk (100 docs per batch).
Asynchronous background indexing for uploads.
Batch vector generation (e.g., model.encode(texts, batch_size=32)) speeds up embedding by >10×.
Common Pitfalls & Solutions
Vector dimension mismatch
# Wrong mapping (512 dims) vs model (384 dims)
mapping = {"vector": {"type": "dense_vector", "dims": 512}}
# Correct mapping
mapping = {"vector": {"type": "dense_vector", "dims": 384}}PDF text extraction garbled
Replace PyPDF2 with pdfplumber for better Chinese support.
import pdfplumber
with pdfplumber.open(pdf_path) as pdf:
text = ""
for page in pdf.pages:
text += page.extract_text()Long‑document retrieval loss
Use overlapping chunking (e.g., 500‑char chunks with 50‑char overlap) to preserve context.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50, length_function=len)Conclusion
The Everything plus RAG system demonstrates that building a production‑grade, open‑source document Q&A service is straightforward: combine fast BM25 search, semantic vector search, RRF fusion, query rewriting, and prompt engineering. Proper engineering—incremental indexing, batch processing, caching, and robust error handling—makes the solution practical and reliable for everyday documentation needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
