How I Doubled RAG Accuracy with Targeted Optimizations

This article walks through a comprehensive, step‑by‑step analysis of why RAG pipelines often underperform and presents concrete optimizations—including OCR preprocessing, table extraction, metadata enrichment, recursive chunking, embedding fine‑tuning, hybrid vector‑keyword retrieval, reranking, prompt templates, and a production‑grade Java implementation—backed by code snippets, benchmark figures, and evaluation metrics.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
How I Doubled RAG Accuracy with Targeted Optimizations

Preface

Many practitioners assume that a RAG pipeline is simply "split documents → embed → feed to a large model" and stop there. In practice the accuracy often falls far short of expectations, leaving users confused.

1. Where RAG Goes Wrong

Any failure in a single stage—document parsing, chunking, embedding, retrieval, reranking, or prompt construction—can collapse the whole system. Teams frequently waste effort swapping embedding models and see less than 5% improvement because the real bottleneck lies elsewhere.

2. Document Parsing

Feeding "clean data" versus "garbage" makes a huge difference. Real‑world corpora contain scanned PDFs, Word tables, PPT slides, and documents with headers/footers that break naïve parsers.

Scanned PDFs: require OCR; without it no text is extracted.

Word tables: plain‑text conversion loses row/column relationships.

PPT slides: bullet points become fragmented chunks.

Headers/footers: introduce meaningless tokens like "Page 3 of 10".

2.1 Solution 1 – OCR First

Run OCR on scanned PDFs. PaddleOCR yields better Chinese results than Tesseract.

import fitz  # PyMuPDF
from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang='ch')

def pdf_to_text_with_ocr(pdf_path):
    doc = fitz.open(pdf_path)
    full_text = []
    for page in doc:
        text = page.get_text()
        if len(text.strip()) < 50:  # likely a scan
            pix = page.get_pixmap()
            img_path = f"temp_page_{page.number}.png"
            pix.save(img_path)
            result = ocr.ocr(img_path, cls=True)
            page_text = ''
            for line in result[0]:
                page_text += line[1][0] + '
'
            full_text.append(page_text)
        else:
            full_text.append(text)
    return '
'.join(full_text)

2.2 Solution 2 – Table Extraction

Convert tables to Markdown instead of plain text so the model can understand the structure.

import pandas as pd
from docx import Document

def extract_tables_from_docx(docx_path):
    doc = Document(docx_path)
    tables_markdown = []
    for table in doc.tables:
        data = []
        for row in table.rows:
            row_data = [cell.text for cell in row.cells]
            data.append(row_data)
        df = pd.DataFrame(data[1:], columns=data[0]) if len(data) > 1 else pd.DataFrame(data)
        tables_markdown.append(df.to_markdown())
    return '

'.join(tables_markdown)

2.3 Solution 3 – Metadata Enrichment

Attach source file name, section title, and creation date to each chunk. This enables source‑based filtering during retrieval.

chunk_metadata = {
    "source_file": "2025_sales_report.pdf",
    "section": "Chapter 3 – East China Sales",
    "page": 23,
    "created_at": "2025-12-01"
}
Avoid over‑loading metadata: keep 2‑3 key fields; too many fields dilute the embedding weight.

3. Document Splitting

Chunking is the "Achilles' heel" of RAG. Too small loses context; too large hurts relevance and token budget.

3.1 Splitting Strategies Comparison

Fixed length : simple but may cut sentences in half.

Sentence split : preserves sentence integrity but may produce tiny chunks.

Paragraph split : keeps semantic units, but long paragraphs hurt retrieval.

Recursive split : hierarchical (paragraph → sentence → fixed length) and balances flexibility with semantics.

Semantic split : uses embedding similarity to decide split points; accurate but costly.

3.2 Recommended Approaches

Solution 1 – RecursiveCharacterTextSplitter (LangChain) is the current best practice.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=200,
    separators=["

", "
", "。", "!", "?", ";", ",", " ", ""]
)
chunks = splitter.split_text(long_document)

Chunk overlap is critical. For example, splitting a paragraph about Spring transaction propagation without overlap can lose the definition of REQUIRED when the next chunk starts with REQUIRES_NEW .

Empirical rule: chunk_size 500‑1000, overlap 20‑30% of chunk_size.

Solution 2 – MarkdownHeaderTextSplitter works well for structured markdown documents, preserving the heading hierarchy as metadata.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [("#", "h1"), ("##", "h2"), ("###", "h3")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
splits = splitter.split_text(markdown_text)
Avoid one‑size‑fits‑all chunk size: use different strategies per document type (FAQ → question split, long article → recursive, codebase → function split).

4. Embedding (Vectorisation)

Swapping models (e.g., text-embedding-ada-002bge-large-zhm3e) yields limited gains because generic embeddings do not understand domain‑specific terminology.

Key capabilities of an embedding model:

Language coverage : Chinese, English, multilingual.

Sentence length : some models excel at short texts (<256 tokens), others handle up to 8192 tokens.

If your corpus is full of domain terms like SKU , PO , BOM , generic embeddings will be ineffective.

4.1 Effective Approach – Fine‑tune the Embedding Model

Collect 300‑1000 labeled sentence pairs (positive = semantically similar, negative = different) and fine‑tune with sentence‑transformers. In our tests, fine‑tuning bge-large-zh on 300 pairs raised domain retrieval accuracy from 68% to 82%.

# Pseudo‑code for fine‑tuning
train_data = [
    ("用户要求退款", "客户申请退货", 1),  # positive
    ("用户要求退款", "系统维护公告", 0),   # negative
    # ... more pairs
]
# Use sentence‑transformers to train

4.2 Hybrid Retrieval – Keyword + Vector

Pure vector search struggles with sparse identifiers like PO‑202400123. Combine BM25 (or Elasticsearch) keyword search with vector re‑ranking.

from rank_bm25 import BM25Okapi

tokenized_corpus = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

query = "订单PO-202400123的物流状态"
bm25_results = bm25.get_top_n(query.split(), corpus, n=20)
vector_results = vector_store.similarity_search(query, k=20)
final_candidates = list(set(bm25_results + vector_results))

Hybrid retrieval typically improves accuracy by 15‑20 percentage points on queries containing codes, numbers, or code snippets.

4.3 Query Rewriting

Rewrite noisy user questions into a more "document‑friendly" form before retrieval.

def rewrite_query(original_query):
    prompt = f"""
    将用户的自然语言问题改写成适合检索的形式,要求:
    1. 提取关键实体(如订单号、产品名、时间范围)
    2. 用陈述句表达
    3. 去掉语气词和无关信息
    原问题:{original_query}
    改写结果:
    """
    return llm.invoke(prompt)

Example: "上个月那个退货单咋还没处理啊" → "2025年5月退货单处理状态".

5. Retrieval and Rerank

Recall quantity matters less than ranking quality. Even if you retrieve 20 chunks, only the top few should be relevant; otherwise the LLM gets noisy context.

5.1 Why Rerank?

Vector similarity orders by embedding distance, not by actual relevance to the user query. A cross‑encoder reranker (e.g., BAAI/bge-reranker-v2-m3) re‑scores the retrieved list.

5.2 Mainstream Rerank Models

Cohere Rerank : best performance, paid API.

BGE‑Reranker : open‑source, can be deployed locally.

Cross‑Encoder : highest precision, slower.

ColBERT : balances latency and accuracy for large‑scale retrieval.

5.3 Code Example

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('BAAI/bge-reranker-v2-m3')

def rerank_results(query, candidates, top_k=5):
    pairs = [[query, doc.page_content] for doc in candidates]
    scores = reranker.predict(pairs)
    scored = list(zip(candidates, scores))
    scored.sort(key=lambda x: x[1], reverse=True)
    return [doc for doc, score in scored[:top_k]]

retrieved_docs = vector_store.similarity_search(query, k=20)
reranked_docs = rerank_results(query, retrieved_docs, top_k=5)
Performance tip: only rerank the top‑k (e.g., 20‑50) candidates; reranking the entire corpus is prohibitively slow.

6. Context Construction

The prompt you feed the LLM determines the final answer quality.

6.1 Basic vs Advanced Prompt Templates

Simple concatenation works but yields mediocre results. Better prompts include source identifiers, few‑shot examples, and a "don't hallucinate" rule.

# Basic prompt
prompt = f"""
根据以下资料回答问题:
{''.join(chunks)}
问题:{query}
"""
# Advanced prompt with citations
prompt = f"""
请根据以下参考资料回答用户问题。每个参考资料都有编号[1]、[2]等。回答时请引用来源编号。

参考资料:
[1] 来自《MySQL性能优化指南》第3章:索引设计原则...
[2] 来自公司内部Wiki《订单系统设计文档》:订单表建表语句...

用户问题:{query}

请给出答案,并在每个关键信息后面标注来源(例如[1])。
"""
# Add "don't know" rule
prompt = f"""
重要规则:如果参考资料中没有明确的信息,请直接说“根据现有资料无法回答该问题”,不要编造答案。

参考资料:{chunks}

问题:{query}

答案:
"""

6.2 Handling Token Overflow

If the combined token count exceeds the model's context window (e.g., 4096 tokens), truncate or summarize.

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
max_prompt_tokens = 3500  # reserve space for answer
selected_chunks = []
total_tokens = 0
for chunk in reranked_chunks:
    tokens = len(enc.encode(chunk.page_content))
    if total_tokens + tokens > max_prompt_tokens:
        break
    selected_chunks.append(chunk)
    total_tokens += tokens
def summarize_chunk(chunk_text):
    summary_prompt = f"请用一句话概括以下内容:
{chunk_text}"
    return llm.invoke(summary_prompt)

6.3 Multi‑turn Conversation

Include recent dialogue history in both the query rewrite step and the final prompt.

def build_context_with_history(query, history, retrieved_chunks):
    history_str = "
".join([f"用户:{h['user']}
助手:{h['assistant']}" for h in history[-3:]])
    prompt = f"""
对话历史:
{history_str}

当前用户问题:{query}

参考资料:
{retrieved_chunks}

请结合对话历史和参考资料回答问题。
"""
    return prompt

7. Advanced Techniques

7.1 Self‑Query Retriever

Let the LLM parse the user question into a structured query with filters (e.g., year, region) and then retrieve with metadata.

from langchain.retrievers.self_query.base import SelfQueryRetriever

metadata_fields = [
    AttributeInfo(name="year", description="文档所属年份", type="int"),
    AttributeInfo(name="region", description="区域:华东/华南/华北", type="string"),
    AttributeInfo(name="doc_type", description="文档类型:销售报告/技术文档", type="string")
]
retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vector_store,
    document_contents="公司销售数据和技术文档",
    metadata_field_info=metadata_fields,
)

Query "2025年华东区的销售数据" becomes query="销售数据" with filter {"year":2025,"region":"华东"}.

7.2 Multi‑Path Retrieval & Fusion

Combine vector search, BM25, and even direct DB lookup, then deduplicate before reranking.

def multi_path_retrieve(query):
    results = []
    # Path 1: vector
    results.extend(vector_store.similarity_search(query, k=10))
    # Path 2: BM25 (simulated)
    results.extend(keyword_search(query, k=10))
    # Path 3: direct DB for IDs like PO‑123456789
    if re.match(r'^PO-\d{9}$', query):
        db_result = db.query(f"SELECT * FROM orders WHERE order_id='{query}'")
        results.append(db_result)
    # Deduplicate & rerank
    return deduplicate_and_rerank(results)

7.3 HyDE (Hypothetical Document Embedding)

Generate a hypothetical answer, embed it, and use it to retrieve real documents. Works best for very short queries.

def hyde_retrieve(query):
    hypothetical_doc = llm.invoke(f"请回答以下问题,写一段详细的答案:
{query}")
    real_docs = vector_store.similarity_search(hypothetical_doc, k=10)
    return real_docs
Warning: HyDE can introduce hallucinations; enable only for queries shorter than five words.

7.4 Window Retrieval

When a relevant chunk is found, also fetch its surrounding N chunks to preserve context.

def retrieve_with_window(chunk_id, chunks_list, window_size=2):
    start = max(0, chunk_id - window_size)
    end = min(len(chunks_list), chunk_id + window_size + 1)
    return chunks_list[start:end]

7.5 Domain‑Specific Indexes

Separate indexes per domain (technical docs, customer service, product manuals) and route queries based on intent classification.

def route_query(query):
    intent = classify_intent(query)  # returns 'tech_doc', 'customer_service', or 'product_manual'
    if intent == 'tech_doc':
        return tech_vectorstore.similarity_search(query)
    elif intent == 'customer_service':
        return service_vectorstore.similarity_search(query)
    else:
        return product_vectorstore.similarity_search(query)

8. Production‑Grade Java Implementation

The following Spring AI + Chroma example shows a full RAG pipeline in Java, covering multi‑path retrieval, reranking, metadata handling, and prompt construction.

package com.example.rag;

import org.springframework.ai.document.Document;
import org.springframework.ai.embedding.EmbeddingClient;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.ai.vectorstore.chroma.ChromaVectorStore;
import org.springframework.ai.vectorstore.filter.Filter;
import org.springframework.ai.vectorstore.filter.FilterExpressionBuilder;
import org.springframework.stereotype.Service;
import reactor.core.publisher.Flux;
import java.util.*;
import java.util.stream.Collectors;

@Service
public class RAGPipeline {
    private final VectorStore vectorStore;
    private final EmbeddingClient embeddingClient;
    private final ChatClient chatClient;
    private final RerankerService reranker;
    private final QueryRewriter queryRewriter;

    public RAGPipeline(VectorStore vectorStore, EmbeddingClient embeddingClient,
                       ChatClient.Builder chatBuilder, RerankerService reranker,
                       QueryRewriter queryRewriter) {
        this.vectorStore = vectorStore;
        this.embeddingClient = embeddingClient;
        this.chatClient = chatBuilder.build();
        this.reranker = reranker;
        this.queryRewriter = queryRewriter;
    }

    public String ask(String userQuestion) {
        return askWithHistory(userQuestion, Collections.emptyList());
    }

    public String askWithHistory(String userQuestion, List<Map<String, String>> history) {
        // 1. Query rewrite (including history)
        String rewrittenQuery = queryRewriter.rewriteWithHistory(userQuestion, history);
        // 2. Multi‑path retrieval
        List<Document> retrieved = multiPathRetrieve(rewrittenQuery);
        // 3. Rerank
        List<Document> reranked = reranker.rerank(rewrittenQuery, retrieved, 5);
        // 4. Build prompt with metadata
        String prompt = buildPrompt(reranked, userQuestion, history);
        // 5. Call LLM
        return chatClient.prompt(prompt).call().content();
    }

    private List<Document> multiPathRetrieve(String query) {
        Set<String> seenIds = new HashSet<>();
        List<Document> all = new ArrayList<>();
        // Path 1: vector
        List<Document> vecResults = vectorStore.similaritySearch(query, 20);
        for (Document doc : vecResults) {
            if (seenIds.add(doc.getId())) all.add(doc);
        }
        // Path 2: keyword (simulated)
        List<Document> kwResults = keywordSearch(query, 20);
        for (Document doc : kwResults) {
            if (seenIds.add(doc.getId())) all.add(doc);
        }
        // Path 3: numeric ID shortcut
        if (query.matches("\\d+")) {
            Filter filter = new FilterExpressionBuilder().eq("id", query).build();
            List<Document> idResults = vectorStore.similaritySearch(SearchRequest.query(query).withFilter(filter).withTopK(10));
            for (Document doc : idResults) {
                if (seenIds.add(doc.getId())) all.add(doc);
            }
        }
        return all;
    }

    private String buildPrompt(List<Document> docs, String question, List<Map<String, String>> history) {
        StringBuilder context = new StringBuilder();
        int idx = 1;
        for (Document doc : docs) {
            String source = doc.getMetadata().getOrDefault("source", "未知来源");
            String page = doc.getMetadata().getOrDefault("page", "");
            context.append(String.format("[%d] 来源:%s", idx++, source));
            if (!page.isEmpty()) context.append(" 第" + page + "页");
            context.append("
").append(doc.getContent()).append("

");
        }
        StringBuilder historyStr = new StringBuilder();
        if (!history.isEmpty()) {
            historyStr.append("对话历史:
");
            for (Map<String, String> turn : history) {
                historyStr.append("用户:").append(turn.get("user")).append("
");
                historyStr.append("助手:").append(turn.get("assistant")).append("
");
            }
            historyStr.append("
");
        }
        return String.format(
            "你是一个专业的问答助手。请基于以下参考资料回答用户问题。

重要规则:
1. 如果参考资料中没有明确信息,请直接回答\"根据现有资料无法回答该问题\"。
2. 回答时请在每个关键信息后标注来源编号,例如[1]。
3. 答案要准确、简洁、有条理。

%s
参考资料:
%s
用户问题:%s

答案:
",
            historyStr.toString(), context.toString(), question);
    }

    private List<Document> keywordSearch(String query, int topK) {
        // Simplified placeholder – replace with Elasticsearch/Lucene in production
        return new ArrayList<>();
    }
}

The companion QueryRewriter component handles both simple cleaning and history‑aware rewriting.

@Component
public class QueryRewriter {
    private final ChatClient chatClient;

    public QueryRewriter(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    public String rewriteWithHistory(String query, List<Map<String, String>> history) {
        if (history.isEmpty()) return rewrite(query);
        String historyText = history.stream()
            .map(turn -> "用户:" + turn.get("user") + "
助手:" + turn.get("assistant"))
            .collect(Collectors.joining("
"));
        String prompt = String.format(
            "对话历史:
%s
最新用户问题:%s
请把历史信息和最新问题融合成一个完整的查询语句,只输出改写后的查询。",
            historyText, query);
        return chatClient.prompt(prompt).call().content();
    }

    public String rewrite(String query) {
        // Simple rule‑based cleaning
        return query.replaceAll("啊|呢|吧|哦|呀", "").trim();
    }
}

This Java service demonstrates multi‑path retrieval, reranking, metadata‑aware prompting, and query rewriting—all ready to be integrated into a Spring Boot application.

9. Evaluation

9.1 Metrics

Recall : retrieved relevant docs / total relevant docs.

Precision : retrieved relevant docs / total retrieved docs.

MRR (Mean Reciprocal Rank) : average of 1 / rank of first correct answer.

Answer Accuracy : human‑judged correctness of the final answer.

9.2 Building a Test Set

Create a labeled dataset of 100‑200 queries with ground‑truth answers and relevant document IDs.

[
  {
    "question": "如何配置Spring Boot的数据库连接池?",
    "ground_truth": "在application.yml中设置spring.datasource.hikari.*相关参数",
    "relevant_docs": ["doc_123", "doc_456"]
  }
]

Run the pipeline on this set after each change and compare metric shifts to ensure real improvement.

9.3 Continuous Iteration

Log bad cases where the system fails.

Enrich the knowledge base with missing information.

Adjust chunking strategy for problematic documents.

Refine prompt templates based on observed failures.

10. Conclusion

Improving RAG accuracy requires systematic work across the entire pipeline rather than focusing on a single component. Our recommended iteration order is:

Document cleaning (OCR, table extraction, metadata).

Optimise chunking (recursive split, appropriate overlap).

Introduce hybrid retrieval (vector + BM25).

Add reranking (cross‑encoder or open‑source BGE‑Reranker).

Polish prompts (source citations, few‑shot examples, "don't hallucinate" rule).

Optional advanced tricks (HyDE, Self‑Query, multi‑path fusion) for marginal gains.

Quantitative evaluation with a solid test set is essential; without data‑driven evidence, perceived improvements are meaningless. By following the detailed steps and code examples above, you can reliably lift RAG accuracy from the 60‑70% range to 85% + and, with advanced techniques, even beyond 90%.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaRAGEmbeddingChunkingHybrid RetrievalRerank
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.