Boost RAG Accuracy to 94%: 11 Proven Strategies and How to Combine Them

After struggling with naive RAG that delivered only 60% accuracy, the author outlines eleven advanced strategies—including context-aware chunking, query expansion, re‑ranking, multi‑query, knowledge graphs, and agent‑based retrieval—that together raise performance to 94%, and provides detailed implementation examples, trade‑offs, and a step‑by‑step deployment roadmap.

dbaplus Community
dbaplus Community
dbaplus Community
Boost RAG Accuracy to 94%: 11 Proven Strategies and How to Combine Them

Why Naive RAG Fails

Naive (plain) RAG splits documents into fixed‑size chunks, uses a single query perspective, lacks relevance filtering, and provides only limited context. This typically yields ~60% answer accuracy and many irrelevant or hallucinated responses.

Fixed‑size chunking cuts sentences in half, losing semantic flow.

Single query view misses alternative phrasings.

No relevance filter returns the nearest but not the most relevant matches.

Limited context – short fragments lack a complete picture.

Eleven Effective RAG Strategies

Strategy 1: Context‑Aware Chunking

Chunk documents by semantic boundaries and document structure (headings, tables, paragraphs) instead of raw character count.

from docling.chunking import HybridChunker
from transformers import AutoTokenizer

class SmartChunker:
    def __init__(self, max_tokens=512):
        self.tokenizer = AutoTokenizer.from_pretrained(
            "sentence-transformers/all-MiniLM-L6-v2"
        )
        self.chunker = HybridChunker(
            tokenizer=self.tokenizer,
            max_tokens=max_tokens,
            merge_peers=True  # merge tiny adjacent chunks
        )

    def chunk_document(self, document):
        # Preserve titles, tables, etc.
        chunks = list(self.chunker.chunk(dl_doc=document))
        contextualized = []
        for chunk in chunks:
            contextualized.append(self.chunker.contextualize(chunk=chunk))
        return contextualized

Pros: free, fast, retains hierarchy, works with any embedding model.

Cons: slightly slower than naive splitting; requires proper document parsing.

When to use: default strategy for most production pipelines.

Strategy 2: Contextual Retrieval

Before embedding, prepend each chunk with a short LLM‑generated summary that explains its relationship to the whole document.

async def enrich_chunk(chunk: str, document: str, title: str) -> str:
    """Add a contextual prefix using an LLM"""
    prompt = f"""
标题:{title}
{document[:4000]}
{chunk}

提供简要上下文(1‑2句话)解释此分块与完整文档的关系。格式:"此分块来自[标题],讨论[解释]。"
"""
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=150,
    )
    context = response.choices[0].message.content.strip()
    return f"{context}

{chunk}"

Pros: reduces retrieval failures by 35‑49% (Anthropic), makes chunks self‑contained, works for vector and keyword search.

Cons: adds an LLM call per chunk (higher cost, slower ingestion, larger index).

When to use: critical documents where accuracy outweighs cost (legal, medical, financial).

Strategy 3: Re‑ranking

Two‑stage retrieval: fast vector search returns 20‑50 candidates, then a cross‑encoder re‑scores them for higher relevance.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

async def search_with_reranking(query: str, limit: int = 5) -> list:
    # Stage 1: fast vector retrieval (4× candidates)
    candidate_limit = min(limit * 4, 20)
    query_embedding = await embedder.embed_query(query)
    candidates = await db.query(
        "SELECT content, metadata FROM chunks ORDER BY embedding $1 LIMIT $2",
        query_embedding, candidate_limit,
    )
    # Stage 2: cross‑encoder re‑ranking
    pairs = [[query, row['content']] for row in candidates]
    scores = reranker.predict(pairs)
    reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:limit]
    return [doc for doc, _ in reranked]

Pros: significantly improves precision, captures more candidates without overwhelming the LLM, can fix vector‑search errors.

Cons: slower than pure vector search, requires extra compute.

When to use: when precision is more important than latency (research‑grade QA).

Strategy 4: Query Expansion

Use an LLM to turn a short user query into a richer, more detailed version.

async def expand_query(query: str) -> str:
    """Expand a terse query into a detailed version"""
    system_prompt = """You are a query‑expansion assistant. Expand the query by adding context, related terms, and clarification while preserving the original intent. Keep it a single coherent question. Expand 2‑3×."""
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Expand this query: {query}"},
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content.strip()

Pros: improves retrieval accuracy, handles vague queries, fast single‑call expansion.

Cons: adds latency, may over‑specify simple queries, modest cost increase.

When to use: chatbots or search interfaces where users often ask short, ambiguous questions.

Strategy 5: Multi‑Query RAG

Generate 3‑4 paraphrases of the original question, search them in parallel, then deduplicate results.

async def search_with_multi_query(query: str, limit: int = 5) -> list:
    variations_prompt = f"""Generate 3 different phrasings for this query:

\"{query}\"

Return only the queries, one per line."""
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": variations_prompt}],
        temperature=0.7,
    )
    queries = [query] + response.choices[0].message.content.strip().split('
')
    # Parallel search for all queries
    tasks = []
    for q in queries:
        qe = await embedder.embed_query(q)
        tasks.append(db.fetch("SELECT * FROM match_chunks($1::vector, $2)", qe, limit))
    results_lists = await asyncio.gather(*tasks)
    # Deduplicate by chunk_id, keep highest similarity
    seen = {}
    for results in results_lists:
        for row in results:
            cid = row['chunk_id']
            if cid not in seen or row['similarity'] > seen[cid]['similarity']:
                seen[cid] = row
    return sorted(seen.values(), key=lambda x: x['similarity'], reverse=True)[:limit]

Pros: higher recall for ambiguous queries, captures different perspectives, parallel execution stays fast.

Cons: up to 4× database queries, higher API cost, possible redundant results.

When to use: exploratory questions where multiple interpretations exist.

Strategy 6: Agent‑Based RAG

An AI agent equipped with several retrieval tools (vector search, full‑document fetch, SQL query) decides at runtime which tool to invoke.

from pydantic_ai import Agent

agent = Agent(
    'openai:gpt-4o',
    system_prompt='You are a RAG assistant with multiple retrieval tools. Choose the appropriate tool for each query.'
)

@agent.tool
async def search_knowledge_base(query: str, limit: int = 5) -> str:
    qe = await embedder.embed_query(query)
    results = await db.match_chunks(qe, limit)
    return format_results(results)

@agent.tool
async def retrieve_full_document(document_title: str) -> str:
    result = await db.query(
        "SELECT title, content FROM documents WHERE title ILIKE %s",
        f"%{document_title}%"
    )
    return f"**{result['title']}**

{result['content']}"

@agent.tool
async def sql_query(question: str) -> str:
    return execute_safe_sql(question)

Pros: highly flexible, can combine heterogeneous data sources, adapts to query complexity.

Cons: more complex to implement, behavior less predictable, higher latency due to multi‑step reasoning.

When to use: heterogeneous environments with documents, databases, and APIs.

Strategy 7: Self‑Reflective RAG

After an initial retrieval, the system grades relevance; if the score is low, it refines the query and searches again.

async def search_with_self_reflection(query: str, limit: int = 5, max_iterations: int = 2) -> dict:
    """Self‑correcting search loop"""
    for iteration in range(max_iterations):
        results = await vector_search(query, limit)
        grade_prompt = f"""Query: {query}
Retrieved docs: {format_docs_for_grading(results)}
Rate relevance 1‑5 (numeric only)."""
        grade_resp = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": grade_prompt}],
            temperature=0,
        )
        grade = int(grade_resp.choices[0].message.content.strip().split()[0])
        if grade >= 3:
            return {"results": results, "iterations": iteration + 1, "final_query": query}
        if iteration < max_iterations - 1:
            refine_prompt = f"""The query \"{query}\" returned low‑relevance results. Suggest an improved query that might retrieve better documents. Respond with only the new query."""
            refined_resp = await client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": refine_prompt}],
                temperature=0.5,
            )
            query = refined_resp.choices[0].message.content.strip()
    return {"results": results, "iterations": max_iterations, "final_query": query}

Pros: self‑correction, can recover from poor initial results.

Cons: highest latency (multiple LLM calls), most expensive, may still fail on extremely hard queries.

When to use: mission‑critical applications where answer correctness outweighs latency.

Strategy 8: Knowledge‑Graph Augmented Retrieval

Combine vector similarity with graph‑database traversal to capture explicit entity relationships.

from graphiti_core import Graphiti
from graphiti_core.nodes import EpisodeType

graphiti = Graphiti("neo4j://localhost:7687", "neo4j", "password")

async def ingest_document(text: str, source: str):
    await graphiti.add_episode(
        name=source,
        episode_body=text,
        source=EpisodeType.text,
        source_description=f"Document: {source}"
    )

async def search_knowledge_graph(query: str) -> str:
    results = await graphiti.search(query=query, num_results=5)
    formatted = []
    for r in results:
        formatted.append(
            f"Entity: {r.node.name}
Type: {r.node.type}
Relations: {r.relationships}"
        )
    return "
---
".join(formatted)

Pros: captures relationships missed by pure text similarity, reduces hallucinations, ideal for highly interconnected data.

Cons: requires Neo4j (or similar) infrastructure, setup is complex, slower and more costly, needs entity extraction.

When to use: domains where entity relationships are crucial (medical networks, financial systems, research databases).

Strategy 9: Hierarchical RAG

Create parent‑child chunk relationships; child chunks are searched for precision, parent chunks are returned for context.

def ingest_hierarchical(document: str, title: str):
    # Parent: large 2000‑char blocks
    parent_chunks = [document[i:i+2000] for i in range(0, len(document), 2000)]
    for pid, parent in enumerate(parent_chunks):
        metadata = {"heading": f"{title} - Part {pid}"}
        db.execute(
            "INSERT INTO parent_chunks (id, content, metadata) VALUES (%s, %s, %s)",
            (pid, parent, json.dumps(metadata))
        )
        # Child: smaller 500‑char blocks
        child_chunks = [parent[j:j+500] for j in range(0, len(parent), 500)]
        for child in child_chunks:
            embedding = get_embedding(child)
            db.execute(
                "INSERT INTO child_chunks (content, embedding, parent_id) VALUES (%s, %s, %s)",
                (child, embedding, pid)
            )

async def hierarchical_search(query: str) -> str:
    query_emb = get_embedding(query)
    results = await db.query(
        """SELECT p.content, p.metadata FROM child_chunks c
           JOIN parent_chunks p ON c.parent_id = p.id
           ORDER BY c.embedding $1 LIMIT 3""",
        query_emb,
    )
    formatted = []
    for content, metadata in results:
        meta = json.loads(metadata)
        formatted.append(f"[{meta['heading']}]
{content}")
    return "

".join(formatted)

Pros: balances precision and context, reduces noise, natural for structured documents.

Cons: requires parent‑child indexing, more complex to maintain, needs careful hierarchy design.

When to use: technical manuals, legal contracts, research papers with clear sections.

Strategy 10: Late‑Chunking

Embed the entire document first, then pool token embeddings for each chunk, preserving full‑document context in every chunk embedding.

def late_chunk(text: str, chunk_size=512) -> list:
    """Chunk after full‑document embedding"""
    # Step 1: embed the whole document (up to model's max tokens)
    full_doc_token_embeddings = transformer_embed(text)  # token‑level embeddings
    # Step 2: define chunk boundaries
    tokens = tokenize(text)
    chunk_boundaries = range(0, len(tokens), chunk_size)
    # Step 3: pool token embeddings for each chunk
    chunks_with_embeddings = []
    for start in chunk_boundaries:
        end = start + chunk_size
        chunk_text = detokenize(tokens[start:end])
        chunk_embedding = mean_pool(full_doc_token_embeddings[start:end])
        chunks_with_embeddings.append((chunk_text, chunk_embedding))
    return chunks_with_embeddings

Pros: retains full‑document context, leverages long‑context models, better semantic understanding.

Cons: needs a long‑context embedding model, more complex implementation, limited by model’s max token count.

When to use: dense technical docs or contracts where every chunk needs the whole context.

Strategy 11: Domain‑Fine‑Tuned Embeddings

Train a sentence‑transformer on domain‑specific query‑document pairs to improve handling of specialized terminology.

from sentence_transformers import SentenceTransformer, losses
from torch.utils.data import DataLoader

def prepare_training_data():
    return [
        ("What is EBITDA?", "EBITDA (Earnings Before Interest, Taxes, Depreciation, and Amortization) ..."),
        ("Explain capital expenditures", "Capital expenditures (CapEx) refer to ..."),
        # ... thousands more pairs
    ]

def fine_tune_model():
    model = SentenceTransformer('all-MiniLM-L6-v2')
    train_examples = prepare_training_data()
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    train_loss = losses.MultipleNegativesRankingLoss(model)
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=3,
        warmup_steps=100,
    )
    model.save('./fine_tuned_model')
    return model

embedding_model = SentenceTransformer('./fine_tuned_model')

Pros: typically improves accuracy by 5‑10%, better handles domain jargon, smaller fine‑tuned models can outperform larger generic ones.

Cons: requires labeled training data, training time and resources, needs periodic re‑training as terminology evolves.

When to use: domains where generic embeddings perform poorly (medical, legal, finance, specialized tech).

Combining Strategies for Maximum Impact

Single strategies improve results, but combining them yields transformative gains. Three proven stacks are presented.

Stack 1 – Production‑Ready (Best Overall)

Context‑aware chunking

Re‑ranking

Query expansion

Agent‑based RAG

Achieves ~92% accuracy with ~1.2 s latency at roughly $0.003 per query.

Stack 2 – High‑Accuracy (Critical Applications)

Contextual retrieval

Multi‑query RAG

Re‑ranking

Self‑reflective RAG

Delivers ~96% accuracy, 2.5 s latency, cost ~ $0.008 per query – ideal for medical, legal, or financial use cases where errors are costly.

Stack 3 – Domain‑Expert (Specialized Fields)

Fine‑tuned embeddings

Contextual retrieval

Knowledge‑graph augmentation

Re‑ranking

Provides ~94% accuracy on domain queries with ~1.8 s latency at ~$0.005 per query after the initial training investment.

Implementation Roadmap

Adopt strategies incrementally.

Stage 1 – Foundations (Week 1): Replace fixed‑size chunking with context‑aware chunking, set up basic vector search with proper embeddings, benchmark baseline accuracy.

Stage 2 – Quick Wins (Weeks 2‑3): Add re‑ranking and query expansion, measure improvements.

Stage 3 – Advanced (Weeks 4‑6): Introduce multi‑query or agent‑based RAG, implement self‑reflection for critical queries, start fine‑tuning embeddings if needed.

Stage 4 – Specialization (Month 2+): Deploy contextual retrieval for high‑value docs, integrate knowledge graphs, continuously fine‑tune embeddings.

Common Pitfalls and How to Avoid Them

Using all strategies at once: leads to complexity, debugging difficulty, and high cost. Start with a minimal stack and iterate.

Skipping baseline measurements: without metrics you cannot prove improvements. Always record accuracy and latency before and after changes.

Fixed‑size chunking: destroys semantic flow. Adopt context‑aware or hierarchical chunking.

Ignoring re‑ranking: vector similarity ≠ relevance. Implement cross‑encoder re‑ranking early.

Neglecting query preprocessing: vague queries fail. At minimum add query expansion.

Relying on a single retrieval method: one size does not fit all. Use an agent or multi‑strategy approach for flexibility.

Future Trends in RAG

Smaller, faster models: next‑gen embedding models promise 10× speed with comparable accuracy.

Multimodal RAG: retrieval of images, tables, and charts alongside text for richer context.

Sparse retrieval learning: models like SPLADE combine neural and sparse representations for efficient search.

Conclusion

Building a production‑ready RAG system is less about using the flashiest components and more about systematically addressing the failure modes of naive RAG. Starting with context‑aware chunking and re‑ranking, then layering additional strategies as needed, yields a reliable pipeline that can be measured, tuned, and scaled.

References

GitHub repository – Advanced RAG Strategies: https://github.com/coleam00/ottomator-agents/tree/main/all-rag-strategies

Anthropic Contextual Retrieval research: https://www.anthropic.com/news/contextual-retrieval

Pinecone Re‑ranking guide: https://www.pinecone.io/learn/series/rag/rerankers/

Docling – Context‑aware chunking: https://github.com/DS4SD/docling

Graphiti – Knowledge‑graph RAG: https://github.com/graphiti-ai/graphiti

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AILLMPrompt engineeringRAGEmbeddingKnowledge Graphretrieval
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.