Artificial Intelligence 9 min read

Building an Elasticsearch‑Powered RAG Q&A System: Theory and Full Code Walkthrough

This article walks through the principles of Retrieval‑Augmented Generation (RAG) and provides a complete Python implementation using Elasticsearch, covering document chunking, semantic embedding, bulk indexing, hybrid BM25‑vector search, RRF result fusion, prompt design, LLM invocation, and a practical demo.

Mingyi World Elasticsearch

Dec 28, 2025

Building an Elasticsearch‑Powered RAG Q&A System: Theory and Full Code Walkthrough

RAG definition

RAG (Retrieval‑Augmented Generation) retrieves relevant passages from a document store, feeds them as context to a large language model (LLM), and lets the LLM generate an answer grounded in the retrieved context.

End‑to‑end pipeline (5 stages)

1. Query rewriting

Short user queries are expanded into multiple variants to improve recall. Example variants for “部署系统”:

部署系统

部署系统详细步骤

部署系统说明文档

什么是部署系统

如何部署系统

2. Document chunking

Documents are split with RecursiveCharacterTextSplitter using a chunk size of 500 characters and an overlap of 50 characters, preserving semantic continuity.

3. Semantic embedding

Each chunk is encoded with the lightweight all-MiniLM-L6-v2 model (384‑dimensional vectors). Model URL: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

4. Bulk indexing into Elasticsearch

# Create index
mapping = {
  'mappings': {
    'properties': {
      'content': {
        'type': 'text',
        'analyzer': 'ik_max_word'  # Chinese tokenizer
      },
      'embedding': {
        'type': 'dense_vector',
        'dims': 384,
        'index': True,  # enable fast vector search
        'similarity': 'cosine'
      },
      'file_path': {'type': 'keyword'},
      'chunk_id': {'type': 'integer'}
    }
  }
}

Setting index: True is required; otherwise vector search is very slow.

5. Hybrid retrieval (BM25 + vector)

BM25 provides fast exact‑term matching, while kNN vector search captures semantic similarity. The two ranked lists are merged with Reciprocal Rank Fusion (RRF):

RRF_score(doc) = Σ 1 / (k + rank_i)

where rank_i is the document’s rank in the i‑th list and k defaults to 60. RRF requires no training, boosts documents appearing in both lists, and is robust to differing score scales.

Prompt construction

PROMPT_TEMPLATE = """你是一个专业的问答助手。请严格根据以下上下文回答问题。

【重要规则】
1. 只能使用提供的上下文信息，不能编造
2. 如果上下文中没有相关信息，明确回答"无法从文档中找到相关信息"
3. 引用信息时标注来源，格式：[块1] [块2]
4. 答案要详细、准确、逻辑清晰

【上下文】
{context}

【问题】
{question}

【回答】
"""

The template enforces strict use of retrieved context, explicit no‑answer handling, and source citation.

LLM invocation

The demo uses DeepSeek; key parameters are: temperature=0.3 to reduce randomness max_tokens=1000 to limit answer length

End‑to‑end example

Test query: “如何配置 Elasticsearch 的分词器？”

result = rag_query("如何配置 Elasticsearch 的分词器")

Sample output shows the original question, expanded query variants, number of retrieved chunks, prompt length, and a concise answer with cited sources and no hallucination.

Key takeaways

Chunk size 500 chars and overlap 50 chars balance granularity and context continuity.

Embedding dimension 384 matches all-MiniLM-L6-v2 output.

Bulk indexing with dense_vector fields and index=True enables fast semantic search.

Hybrid BM25 + vector retrieval combines precise term matching with semantic recall.

RRF provides a simple, effective fusion without score normalization.

Prompt design with strict constraints and source citation prevents hallucination.

LLM parameters (temperature, max_tokens) control answer stability and length.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Prompt Engineering Elasticsearch RAG Retrieval-Augmented Generation Hybrid Search RRF

Written by

Mingyi World Elasticsearch

The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

RAG definition

End‑to‑end pipeline (5 stages)

1. Query rewriting

2. Document chunking

3. Semantic embedding

4. Bulk indexing into Elasticsearch

5. Hybrid retrieval (BM25 + vector)

Prompt construction

LLM invocation

End‑to‑end example

Key takeaways

Mingyi World Elasticsearch

How this landed with the community

Was this worth your time?

0 Comments

5. Hybrid retrieval (BM25 + vector)