Building an Elasticsearch‑Powered RAG Q&A System: Theory and Full Code Walkthrough
This article walks through the principles of Retrieval‑Augmented Generation (RAG) and provides a complete Python implementation using Elasticsearch, covering document chunking, semantic embedding, bulk indexing, hybrid BM25‑vector search, RRF result fusion, prompt design, LLM invocation, and a practical demo.
RAG definition
RAG (Retrieval‑Augmented Generation) retrieves relevant passages from a document store, feeds them as context to a large language model (LLM), and lets the LLM generate an answer grounded in the retrieved context.
End‑to‑end pipeline (5 stages)
1. Query rewriting
Short user queries are expanded into multiple variants to improve recall. Example variants for “部署系统”:
部署系统
部署系统 详细步骤
部署系统 说明文档
什么是 部署系统
如何 部署系统
2. Document chunking
Documents are split with RecursiveCharacterTextSplitter using a chunk size of 500 characters and an overlap of 50 characters, preserving semantic continuity.
3. Semantic embedding
Each chunk is encoded with the lightweight all-MiniLM-L6-v2 model (384‑dimensional vectors). Model URL: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
4. Bulk indexing into Elasticsearch
# Create index
mapping = {
'mappings': {
'properties': {
'content': {
'type': 'text',
'analyzer': 'ik_max_word' # Chinese tokenizer
},
'embedding': {
'type': 'dense_vector',
'dims': 384,
'index': True, # enable fast vector search
'similarity': 'cosine'
},
'file_path': {'type': 'keyword'},
'chunk_id': {'type': 'integer'}
}
}
}Setting index: True is required; otherwise vector search is very slow.
5. Hybrid retrieval (BM25 + vector)
BM25 provides fast exact‑term matching, while kNN vector search captures semantic similarity. The two ranked lists are merged with Reciprocal Rank Fusion (RRF):
RRF_score(doc) = Σ 1 / (k + rank_i)where rank_i is the document’s rank in the i‑th list and k defaults to 60. RRF requires no training, boosts documents appearing in both lists, and is robust to differing score scales.
Prompt construction
PROMPT_TEMPLATE = """你是一个专业的问答助手。请严格根据以下上下文回答问题。
【重要规则】
1. 只能使用提供的上下文信息,不能编造
2. 如果上下文中没有相关信息,明确回答"无法从文档中找到相关信息"
3. 引用信息时标注来源,格式:[块1] [块2]
4. 答案要详细、准确、逻辑清晰
【上下文】
{context}
【问题】
{question}
【回答】
"""The template enforces strict use of retrieved context, explicit no‑answer handling, and source citation.
LLM invocation
The demo uses DeepSeek; key parameters are: temperature=0.3 to reduce randomness max_tokens=1000 to limit answer length
End‑to‑end example
Test query: “如何配置 Elasticsearch 的分词器?”
result = rag_query("如何配置 Elasticsearch 的分词器")Sample output shows the original question, expanded query variants, number of retrieved chunks, prompt length, and a concise answer with cited sources and no hallucination.
Key takeaways
Chunk size 500 chars and overlap 50 chars balance granularity and context continuity.
Embedding dimension 384 matches all-MiniLM-L6-v2 output.
Bulk indexing with dense_vector fields and index=True enables fast semantic search.
Hybrid BM25 + vector retrieval combines precise term matching with semantic recall.
RRF provides a simple, effective fusion without score normalization.
Prompt design with strict constraints and source citation prevents hallucination.
LLM parameters (temperature, max_tokens) control answer stability and length.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Mingyi World Elasticsearch
The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
