Artificial Intelligence 35 min read

How I Doubled RAG Accuracy with These Optimizations

This article walks through a complete RAG pipeline, identifying common pitfalls from document preprocessing to prompt construction, and provides concrete Python and Java examples, chunking strategies, embedding tweaks, hybrid retrieval, reranking, advanced techniques, and evaluation methods to reliably double retrieval accuracy.

IT Services Circle

Jun 20, 2026

How I Doubled RAG Accuracy with These Optimizations

Why RAG often fails

Most teams treat RAG as a three‑step pipeline – document → split → embed → feed the LLM – and stop there. Any error in a single step breaks the whole chain, so improving accuracy requires looking at the entire workflow.

1. Document parsing

Real‑world corpora contain scanned PDFs, Word tables, PPT slides and headers/footers. If these are fed raw, the LLM receives garbage.

Scanned PDFs : run OCR. The example below uses PaddleOCR (better for Chinese than Tesseract).

import fitz  # PyMuPDF
import paddleocr
from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang='ch')

def pdf_to_text_with_ocr(pdf_path):
    doc = fitz.open(pdf_path)
    full_text = []
    for page in doc:
        text = page.get_text()
        if len(text.strip()) < 50:  # likely a scanned page
            pix = page.get_pixmap()
            img_path = f"temp_page_{page.number}.png"
            pix.save(img_path)
            result = ocr.ocr(img_path, cls=True)
            page_text = ''
            for line in result[0]:
                page_text += line[1][0] + '
'
            full_text.append(page_text)
        else:
            full_text.append(text)
    return '
'.join(full_text)

Word/PDF tables : extract them to Markdown so the model can understand the structure. Example using pandas and python‑docx.

import pandas as pd
from docx import Document

def extract_tables_from_docx(docx_path):
    doc = Document(docx_path)
    tables_md = []
    for table in doc.tables:
        data = []
        for row in table.rows:
            data.append([cell.text for cell in row.cells])
        df = pd.DataFrame(data[1:], columns=data[0]) if len(data) > 1 else pd.DataFrame(data)
        tables_md.append(df.to_markdown())
    return '

'.join(tables_md)

Metadata attachment : add a few key fields (e.g., source_file, section, page, created_at) to each chunk. This enables source‑based filtering and citation.

chunk_metadata = {
    "source_file": "2025_sales_report.pdf",
    "section": "Chapter 3 – East China Sales",
    "page": 23,
    "created_at": "2025-12-01"
}

Keep metadata lightweight – 2‑3 fields are enough.

2. Document splitting

Chunk size is a “linchpin”. Too small loses context; too large hurts retrieval precision and blows up token usage.

2.1 Splitting method comparison

Fixed length – simple but can cut sentences in half.

By sentence – preserves whole sentences but may produce tiny chunks.

By paragraph – keeps semantic units but long paragraphs reduce precision.

Recursive splitting – paragraph → sentence → fixed length; balances semantics and control.

Semantic splitting – uses embedding similarity to cut where topic changes; best quality but expensive.

2.2 Practical recipe (LangChain)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,          # target characters per chunk
    chunk_overlap=200,      # 20‑30% overlap to preserve context
    separators=["

", "
", "。", "！", "？", "；", "，", " ", ""]
)
chunks = splitter.split_text(long_document)

Empirically, chunk_size 500‑1000 and overlap 20‑30% works well for most Chinese/English docs. Overlap prevents loss of definitions (e.g., a Spring transaction description split across chunks would otherwise miss the REQUIRED part).

3. Embedding (vectorisation)

Swapping generic models (e.g., text‑embedding‑ada‑002, bge‑large‑zh, m3e) usually yields <5% gain because they do not understand domain‑specific terminology such as SKU, PO, BOM.

3.1 Effective solutions

Fine‑tune the embedding model on a small labelled set (300‑1000 positive/negative sentence pairs). Example with sentence‑transformers improves domain recall from 68% to 82%.

# pseudo‑code
train_data = [
    ("用户要求退款", "客户申请退货", 1),   # positive
    ("用户要求退款", "系统维护公告", 0)    # negative
    # ... more pairs
]
# fine‑tune with sentence‑transformers

Hybrid retrieval (BM25 + vector) : first retrieve a candidate set with keyword search (e.g., rank_bm25), then re‑rank with dense vectors.

from rank_bm25 import BM25Okapi

# build BM25 index
tokenized_corpus = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

bm25_results = bm25.get_top_n(query.split(), corpus, n=20)
vector_results = vector_store.similarity_search(query, k=20)
final_candidates = list(set(bm25_results + vector_results))

Hybrid search typically adds 15‑20% recall for proprietary identifiers.

Query rewriting : transform a noisy, colloquial question into a retrieval‑friendly statement.

def rewrite_query(original_query):
    prompt = f"""
    将用户的自然语言问题改写成适合检索的形式，要求：
    1. 提取关键实体（如订单号、产品名、时间范围）
    2. 用陈述句表达
    3. 去掉语气词和无关信息
    原问题：{original_query}
    改写结果：
    """
    return llm.invoke(prompt)

# Example
# "上个月那个退货单咋还没处理啊" → "2025‑05‑01 退货单处理状态"

4. Retrieval & reranking

Recall quantity matters less than ranking quality. Vector search returns results sorted by semantic similarity, not by relevance to the user query. A reranker (cross‑encoder) re‑scores the top‑k candidates.

4.1 Model comparison

Cohere Rerank – best accuracy, free tier, API‑based.

BGE‑Reranker – open‑source, can run locally; good for data‑sensitive scenarios.

Cross‑Encoder (e.g., BAAI/bge‑reranker‑v2‑m3 ) – highest precision but slower; suited for small candidate sets.

ColBERT – balances speed and accuracy for large‑scale retrieval.

4.2 Rerank code example

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('BAAI/bge-reranker-v2-m3')

def rerank_results(query, candidates, top_k=5):
    pairs = [[query, doc.page_content] for doc in candidates]
    scores = reranker.predict(pairs)
    scored = list(zip(candidates, scores))
    scored.sort(key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in scored[:top_k]]

retrieved_docs = vector_store.similarity_search(query, k=20)
reranked_docs = rerank_results(query, retrieved_docs, top_k=5)

Tip : never rerank the whole corpus – first fetch a modest top_k (20‑50) then rerank.

5. Prompt engineering (context construction)

The final answer quality is dictated by the prompt fed to the LLM.

5.1 Basic vs advanced templates

Basic prompt (often mediocre):

prompt = f"""
根据以下资料回答问题：
{''.join(chunks)}

问题：{query}
"""

Improved prompt adds source IDs, few‑shot examples and a “don’t hallucinate” rule.

prompt = f"""
请根据以下参考资料回答用户问题。每个参考资料都有编号[1]、[2]等，回答时请引用来源编号。

参考资料：
[1] 来自《MySQL性能优化指南》第3章：索引设计原则…
[2] 来自公司内部Wiki《订单系统设计文档》：订单表建表语句…
[3] 来自《2025年技术周报》：慢查询优化案例…

用户问题：{query}

请给出答案，并在每个关键信息后面标注来源（例如[1]）。
"""

Few‑shot example:

prompt = f"""
请参考以下示例的格式回答问题。

示例1：
问题：如何查看MySQL版本？
答案：可以使用SELECT VERSION();命令查看MySQL版本。

示例2：
问题：Spring Boot如何配置日志级别？
答案：在application.yml中配置logging.level.root=INFO。

现在请回答：
问题：{query}
参考资料：{chunks}
答案：
"""

Never‑answer‑if‑unknown rule (critical to avoid hallucination):

prompt = f"""
重要规则：如果参考资料中没有明确的信息，请直接说“根据现有资料无法回答该问题”，不要编造答案。

参考资料：{chunks}

问题：{query}
答案：
"""

5.2 Handling token overflow

When the concatenated chunks exceed the model’s context window, truncate by relevance (already sorted by rerank) or summarise long chunks.

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
max_prompt_tokens = 3500  # reserve space for answer
selected = []
total = 0
for chunk in reranked_chunks:
    n = len(enc.encode(chunk.page_content))
    if total + n > max_prompt_tokens:
        break
    selected.append(chunk)
    total += n

Summarisation (one‑sentence) can be done with a short LLM call:

def summarize_chunk(text):
    prompt = f"请用一句话概括以下内容：
{text}"
    return llm.invoke(prompt)

5.3 Multi‑turn conversation

Include recent dialogue history (e.g., last 3 turns) in both the retrieval query and the final prompt.

def build_context_with_history(query, history, retrieved_chunks):
    history_str = "
".join(
        f"用户：{h['user']}
助手：{h['assistant']}" for h in history[-3:]
    )
    prompt = f"""
    对话历史：
    {history_str}

    当前用户问题：{query}

    参考资料：
    {retrieved_chunks}

    请结合对话历史和参考资料回答问题。
    """
    return prompt

# Query rewriting with history
def query_with_history(current_query, history):
    prompt = f"""
    对话历史：{history}
    最新问题：{current_query}
    请把历史信息和最新问题融合成一个完整的查询语句。
    """
    return llm.invoke(prompt)

6. Advanced techniques

6.1 Self‑Query Retriever

Let the LLM parse the user question into structured filters (e.g., year=2025, region='华东') and then perform metadata‑aware retrieval.

from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.schema import AttributeInfo

metadata_fields = [
    AttributeInfo(name="year", description="文档所属年份", type="int"),
    AttributeInfo(name="region", description="区域：华东/华南/华北", type="string"),
    AttributeInfo(name="doc_type", description="文档类型", type="string")
]

retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vector_store,
    document_contents="公司销售数据和技术文档",
    metadata_field_info=metadata_fields
)

# Example: query "2025年华东区的销售数据" → filter {"year":2025, "region":"华东"}

6.2 Multi‑path retrieval & fusion

def multi_path_retrieve(query):
    results = []
    # Path 1: dense vector
    results.extend(vector_store.similarity_search(query, k=10))
    # Path 2: BM25 keyword
    results.extend(bm25_search(query, k=10))
    # Path 3: direct DB lookup for IDs like PO-123456789
    if re.match(r"^PO-\d{9}$", query):
        db_res = db.query(f"SELECT * FROM orders WHERE order_id='{query}'")
        results.append(db_res)
    # Deduplicate then rerank
    unique = {doc.id: doc for doc in results}.values()
    return reranker.rerank(query, list(unique), top_k=5)

6.3 HyDE (Hypothetical Document Embedding)

For very abstract queries, first let the LLM generate a hypothetical answer, embed that text, and use it to retrieve real documents.

def hyde_retrieve(query):
    hypothetical = llm.invoke(f"请回答以下问题，写一段详细的答案：
{query}")
    return vector_store.similarity_search(hypothetical, k=10)

Effective for short queries (<5 words) but beware of hallucination.

6.4 Window retrieval

When a relevant chunk is found, also fetch its surrounding chunks to provide context.

def retrieve_with_window(chunk_id, chunks_list, window=2):
    start = max(0, chunk_id - window)
    end = min(len(chunks_list), chunk_id + window + 1)
    return chunks_list[start:end]

6.5 Separate indexes per domain

Instead of a single massive index, maintain specialised indexes (e.g., tech_doc, customer_service, product_manual) and route the query after intent classification.

def classify_intent(query):
    # lightweight classifier returning 'tech_doc', 'customer_service', or 'product_manual'
    ...

def route_query(query):
    intent = classify_intent(query)
    if intent == 'tech_doc':
        return tech_vectorstore.similarity_search(query)
    elif intent == 'customer_service':
        return service_vectorstore.similarity_search(query)
    else:
        return product_vectorstore.similarity_search(query)

7. Production‑grade Java RAG pipeline (Spring AI + Chroma)

package com.example.rag;

import org.springframework.ai.document.Document;
import org.springframework.ai.embedding.EmbeddingClient;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.ai.vectorstore.chroma.ChromaVectorStore;
import org.springframework.ai.vectorstore.filter.Filter;
import org.springframework.ai.vectorstore.filter.FilterExpressionBuilder;
import org.springframework.stereotype.Service;
import java.util.*;
import java.util.stream.Collectors;

@Service
public class RAGPipeline {
    private final VectorStore vectorStore;
    private final EmbeddingClient embeddingClient;
    private final ChatClient chatClient;
    private final RerankerService reranker;
    private final QueryRewriter queryRewriter;

    public RAGPipeline(VectorStore vectorStore, EmbeddingClient embeddingClient,
                       ChatClient.Builder chatBuilder, RerankerService reranker,
                       QueryRewriter queryRewriter) {
        this.vectorStore = vectorStore;
        this.embeddingClient = embeddingClient;
        this.chatClient = chatBuilder.build();
        this.reranker = reranker;
        this.queryRewriter = queryRewriter;
    }

    /** 完整的 RAG 问答流程 */
    public String ask(String userQuestion) {
        return askWithHistory(userQuestion, Collections.emptyList());
    }

    public String askWithHistory(String userQuestion, List<Map<String, String>> history) {
        // 1. 查询改写（含多轮上下文）
        String rewritten = queryRewriter.rewriteWithHistory(userQuestion, history);
        // 2. 多路检索
        List<Document> retrieved = multiPathRetrieve(rewritten);
        // 3. 重排序
        List<Document> reranked = reranker.rerank(rewritten, retrieved, 5);
        // 4. 构建带来源的 Prompt
        String prompt = buildPrompt(reranked, userQuestion, history);
        // 5. 调用大模型
        return chatClient.prompt(prompt).call().content();
    }

    private List<Document> multiPathRetrieve(String query) {
        Set<String> seen = new HashSet<>();
        List<Document> all = new ArrayList<>();
        // 向量检索
        for (Document d : vectorStore.similaritySearch(query, 20)) {
            if (seen.add(d.getId())) all.add(d);
        }
        // 关键词检索（占位）
        for (Document d : keywordSearch(query, 20)) {
            if (seen.add(d.getId())) all.add(d);
        }
        // ID 精确匹配
        if (query.matches("\\d+")) {
            Filter f = new FilterExpressionBuilder().eq("id", query).build();
            for (Document d : vectorStore.similaritySearch(SearchRequest.query(query).withFilter(f).withTopK(10))) {
                if (seen.add(d.getId())) all.add(d);
            }
        }
        return all;
    }

    private String buildPrompt(List<Document> docs, String question, List<Map<String, String>> history) {
        StringBuilder ctx = new StringBuilder();
        int idx = 1;
        for (Document d : docs) {
            String src = d.getMetadata().getOrDefault("source", "未知来源");
            String page = d.getMetadata().getOrDefault("page", "");
            ctx.append(String.format("[%d] 来源：%s", idx++, src));
            if (!page.isEmpty()) ctx.append(" 第" + page + "页");
            ctx.append("
").append(d.getContent()).append("

");
        }
        StringBuilder hist = new StringBuilder();
        if (!history.isEmpty()) {
            hist.append("对话历史：
");
            for (Map<String, String> turn : history) {
                hist.append("用户：").append(turn.get("user")).append("
");
                hist.append("助手：").append(turn.get("assistant")).append("
");
            }
            hist.append("
");
        }
        return String.format(
            "你是一个专业的问答助手。请基于以下参考资料回答用户问题。

重要规则：如果参考资料中没有明确的信息，请直接说\"根据现有资料无法回答该问题\"，不要编造答案。

%s
参考资料：
%s
用户问题：%s
答案：",
            hist.toString(), ctx.toString(), question);
    }

    private List<Document> keywordSearch(String query, int topK) {
        // 实际可接入 Elasticsearch / Lucene，这里返回空列表作占位
        return new ArrayList<>();
    }
}

The accompanying QueryRewriter component rewrites queries with recent dialogue context.

@Component
public class QueryRewriter {
    private final ChatClient chatClient;
    public QueryRewriter(ChatClient.Builder builder) { this.chatClient = builder.build(); }

    public String rewriteWithHistory(String query, List<Map<String, String>> history) {
        if (history.isEmpty()) return rewrite(query);
        String historyText = history.stream()
            .map(t -> "用户：" + t.get("user") + "
助手：" + t.get("assistant"))
            .collect(Collectors.joining("
"));
        String prompt = String.format(
            "对话历史：
%s
最新用户问题：%s
请把对话历史融合成一个完整的检索语句，只输出改写后的查询。",
            historyText, query);
        return chatClient.prompt(prompt).call().content();
    }

    public String rewrite(String query) {
        // 简单规则：去除常见语气词
        return query.replaceAll("啊|呢|吧|哦|呀", "").trim();
    }
}

This Java implementation covers multi‑path retrieval, hybrid ranking, metadata‑aware prompting and multi‑turn handling, ready to be integrated into a Spring Boot project.

8. Evaluation

8.1 Core metrics

Recall : relevant docs retrieved ÷ total relevant docs.

Precision : relevant docs retrieved ÷ total retrieved docs.

MRR (Mean Reciprocal Rank) : average of 1 ÷ rank of the first correct document.

Answer Accuracy : human judgment of the final answer.

8.2 Building a test set

Create a JSON file with at least 100‑200 entries, each containing question, ground_truth, and relevant_docs (list of IDs). Example:

[
  {
    "question": "如何配置Spring Boot的数据库连接池？",
    "ground_truth": "在application.yml中设置spring.datasource.hikari.*相关参数",
    "relevant_docs": ["doc_123", "doc_456"]
  },
  ...
]

Run the full RAG pipeline on this set after each change and record the metrics. This quantitative feedback prevents “psychological” improvements.

8.3 Continuous improvement checklist

Log bad cases and annotate why they failed.

Enrich the knowledge base with missing information.

Adjust chunking strategy if a particular chunk causes errors.

Refine prompt templates based on observed failures.

9. Practical iteration order

Document cleaning – OCR, table extraction, metadata attachment (biggest ROI, often >50% gain).

Chunking – switch to recursive splitting with 20‑30% overlap; tune chunk_size to 500‑1000.

Hybrid retrieval – add BM25 keyword layer; expect 15‑20% recall boost.

Reranking – apply a cross‑encoder on the top‑k candidates; adds ~10% answer accuracy.

Prompt polishing – source IDs, few‑shot examples, “don’t answer if unknown” rule.

If further gains are needed, experiment with Self‑Query Retriever , HyDE , window retrieval and domain‑specific indexes .

Remember to keep a labelled test set and track the four metrics after each iteration – that’s the only reliable way to know you truly improved the system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Artificial Intelligence Python prompt engineering RAG Vector Search Embedding

Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.