Boost RAG Accuracy to 94%: 11 Proven Strategies and How to Combine Them
After struggling with naive RAG that delivered only 60% accuracy, the author outlines eleven advanced strategies—including context-aware chunking, query expansion, re‑ranking, multi‑query, knowledge graphs, and agent‑based retrieval—that together raise performance to 94%, and provides detailed implementation examples, trade‑offs, and a step‑by‑step deployment roadmap.
Why Naive RAG Fails
Naive (plain) RAG splits documents into fixed‑size chunks, uses a single query perspective, lacks relevance filtering, and provides only limited context. This typically yields ~60% answer accuracy and many irrelevant or hallucinated responses.
Fixed‑size chunking cuts sentences in half, losing semantic flow.
Single query view misses alternative phrasings.
No relevance filter returns the nearest but not the most relevant matches.
Limited context – short fragments lack a complete picture.
Eleven Effective RAG Strategies
Strategy 1: Context‑Aware Chunking
Chunk documents by semantic boundaries and document structure (headings, tables, paragraphs) instead of raw character count.
from docling.chunking import HybridChunker
from transformers import AutoTokenizer
class SmartChunker:
def __init__(self, max_tokens=512):
self.tokenizer = AutoTokenizer.from_pretrained(
"sentence-transformers/all-MiniLM-L6-v2"
)
self.chunker = HybridChunker(
tokenizer=self.tokenizer,
max_tokens=max_tokens,
merge_peers=True # merge tiny adjacent chunks
)
def chunk_document(self, document):
# Preserve titles, tables, etc.
chunks = list(self.chunker.chunk(dl_doc=document))
contextualized = []
for chunk in chunks:
contextualized.append(self.chunker.contextualize(chunk=chunk))
return contextualizedPros: free, fast, retains hierarchy, works with any embedding model.
Cons: slightly slower than naive splitting; requires proper document parsing.
When to use: default strategy for most production pipelines.
Strategy 2: Contextual Retrieval
Before embedding, prepend each chunk with a short LLM‑generated summary that explains its relationship to the whole document.
async def enrich_chunk(chunk: str, document: str, title: str) -> str:
"""Add a contextual prefix using an LLM"""
prompt = f"""
标题:{title}
{document[:4000]}
{chunk}
提供简要上下文(1‑2句话)解释此分块与完整文档的关系。格式:"此分块来自[标题],讨论[解释]。"
"""
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=150,
)
context = response.choices[0].message.content.strip()
return f"{context}
{chunk}"Pros: reduces retrieval failures by 35‑49% (Anthropic), makes chunks self‑contained, works for vector and keyword search.
Cons: adds an LLM call per chunk (higher cost, slower ingestion, larger index).
When to use: critical documents where accuracy outweighs cost (legal, medical, financial).
Strategy 3: Re‑ranking
Two‑stage retrieval: fast vector search returns 20‑50 candidates, then a cross‑encoder re‑scores them for higher relevance.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
async def search_with_reranking(query: str, limit: int = 5) -> list:
# Stage 1: fast vector retrieval (4× candidates)
candidate_limit = min(limit * 4, 20)
query_embedding = await embedder.embed_query(query)
candidates = await db.query(
"SELECT content, metadata FROM chunks ORDER BY embedding $1 LIMIT $2",
query_embedding, candidate_limit,
)
# Stage 2: cross‑encoder re‑ranking
pairs = [[query, row['content']] for row in candidates]
scores = reranker.predict(pairs)
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:limit]
return [doc for doc, _ in reranked]Pros: significantly improves precision, captures more candidates without overwhelming the LLM, can fix vector‑search errors.
Cons: slower than pure vector search, requires extra compute.
When to use: when precision is more important than latency (research‑grade QA).
Strategy 4: Query Expansion
Use an LLM to turn a short user query into a richer, more detailed version.
async def expand_query(query: str) -> str:
"""Expand a terse query into a detailed version"""
system_prompt = """You are a query‑expansion assistant. Expand the query by adding context, related terms, and clarification while preserving the original intent. Keep it a single coherent question. Expand 2‑3×."""
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Expand this query: {query}"},
],
temperature=0.3,
)
return response.choices[0].message.content.strip()Pros: improves retrieval accuracy, handles vague queries, fast single‑call expansion.
Cons: adds latency, may over‑specify simple queries, modest cost increase.
When to use: chatbots or search interfaces where users often ask short, ambiguous questions.
Strategy 5: Multi‑Query RAG
Generate 3‑4 paraphrases of the original question, search them in parallel, then deduplicate results.
async def search_with_multi_query(query: str, limit: int = 5) -> list:
variations_prompt = f"""Generate 3 different phrasings for this query:
\"{query}\"
Return only the queries, one per line."""
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": variations_prompt}],
temperature=0.7,
)
queries = [query] + response.choices[0].message.content.strip().split('
')
# Parallel search for all queries
tasks = []
for q in queries:
qe = await embedder.embed_query(q)
tasks.append(db.fetch("SELECT * FROM match_chunks($1::vector, $2)", qe, limit))
results_lists = await asyncio.gather(*tasks)
# Deduplicate by chunk_id, keep highest similarity
seen = {}
for results in results_lists:
for row in results:
cid = row['chunk_id']
if cid not in seen or row['similarity'] > seen[cid]['similarity']:
seen[cid] = row
return sorted(seen.values(), key=lambda x: x['similarity'], reverse=True)[:limit]Pros: higher recall for ambiguous queries, captures different perspectives, parallel execution stays fast.
Cons: up to 4× database queries, higher API cost, possible redundant results.
When to use: exploratory questions where multiple interpretations exist.
Strategy 6: Agent‑Based RAG
An AI agent equipped with several retrieval tools (vector search, full‑document fetch, SQL query) decides at runtime which tool to invoke.
from pydantic_ai import Agent
agent = Agent(
'openai:gpt-4o',
system_prompt='You are a RAG assistant with multiple retrieval tools. Choose the appropriate tool for each query.'
)
@agent.tool
async def search_knowledge_base(query: str, limit: int = 5) -> str:
qe = await embedder.embed_query(query)
results = await db.match_chunks(qe, limit)
return format_results(results)
@agent.tool
async def retrieve_full_document(document_title: str) -> str:
result = await db.query(
"SELECT title, content FROM documents WHERE title ILIKE %s",
f"%{document_title}%"
)
return f"**{result['title']}**
{result['content']}"
@agent.tool
async def sql_query(question: str) -> str:
return execute_safe_sql(question)Pros: highly flexible, can combine heterogeneous data sources, adapts to query complexity.
Cons: more complex to implement, behavior less predictable, higher latency due to multi‑step reasoning.
When to use: heterogeneous environments with documents, databases, and APIs.
Strategy 7: Self‑Reflective RAG
After an initial retrieval, the system grades relevance; if the score is low, it refines the query and searches again.
async def search_with_self_reflection(query: str, limit: int = 5, max_iterations: int = 2) -> dict:
"""Self‑correcting search loop"""
for iteration in range(max_iterations):
results = await vector_search(query, limit)
grade_prompt = f"""Query: {query}
Retrieved docs: {format_docs_for_grading(results)}
Rate relevance 1‑5 (numeric only)."""
grade_resp = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": grade_prompt}],
temperature=0,
)
grade = int(grade_resp.choices[0].message.content.strip().split()[0])
if grade >= 3:
return {"results": results, "iterations": iteration + 1, "final_query": query}
if iteration < max_iterations - 1:
refine_prompt = f"""The query \"{query}\" returned low‑relevance results. Suggest an improved query that might retrieve better documents. Respond with only the new query."""
refined_resp = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": refine_prompt}],
temperature=0.5,
)
query = refined_resp.choices[0].message.content.strip()
return {"results": results, "iterations": max_iterations, "final_query": query}Pros: self‑correction, can recover from poor initial results.
Cons: highest latency (multiple LLM calls), most expensive, may still fail on extremely hard queries.
When to use: mission‑critical applications where answer correctness outweighs latency.
Strategy 8: Knowledge‑Graph Augmented Retrieval
Combine vector similarity with graph‑database traversal to capture explicit entity relationships.
from graphiti_core import Graphiti
from graphiti_core.nodes import EpisodeType
graphiti = Graphiti("neo4j://localhost:7687", "neo4j", "password")
async def ingest_document(text: str, source: str):
await graphiti.add_episode(
name=source,
episode_body=text,
source=EpisodeType.text,
source_description=f"Document: {source}"
)
async def search_knowledge_graph(query: str) -> str:
results = await graphiti.search(query=query, num_results=5)
formatted = []
for r in results:
formatted.append(
f"Entity: {r.node.name}
Type: {r.node.type}
Relations: {r.relationships}"
)
return "
---
".join(formatted)Pros: captures relationships missed by pure text similarity, reduces hallucinations, ideal for highly interconnected data.
Cons: requires Neo4j (or similar) infrastructure, setup is complex, slower and more costly, needs entity extraction.
When to use: domains where entity relationships are crucial (medical networks, financial systems, research databases).
Strategy 9: Hierarchical RAG
Create parent‑child chunk relationships; child chunks are searched for precision, parent chunks are returned for context.
def ingest_hierarchical(document: str, title: str):
# Parent: large 2000‑char blocks
parent_chunks = [document[i:i+2000] for i in range(0, len(document), 2000)]
for pid, parent in enumerate(parent_chunks):
metadata = {"heading": f"{title} - Part {pid}"}
db.execute(
"INSERT INTO parent_chunks (id, content, metadata) VALUES (%s, %s, %s)",
(pid, parent, json.dumps(metadata))
)
# Child: smaller 500‑char blocks
child_chunks = [parent[j:j+500] for j in range(0, len(parent), 500)]
for child in child_chunks:
embedding = get_embedding(child)
db.execute(
"INSERT INTO child_chunks (content, embedding, parent_id) VALUES (%s, %s, %s)",
(child, embedding, pid)
)
async def hierarchical_search(query: str) -> str:
query_emb = get_embedding(query)
results = await db.query(
"""SELECT p.content, p.metadata FROM child_chunks c
JOIN parent_chunks p ON c.parent_id = p.id
ORDER BY c.embedding $1 LIMIT 3""",
query_emb,
)
formatted = []
for content, metadata in results:
meta = json.loads(metadata)
formatted.append(f"[{meta['heading']}]
{content}")
return "
".join(formatted)Pros: balances precision and context, reduces noise, natural for structured documents.
Cons: requires parent‑child indexing, more complex to maintain, needs careful hierarchy design.
When to use: technical manuals, legal contracts, research papers with clear sections.
Strategy 10: Late‑Chunking
Embed the entire document first, then pool token embeddings for each chunk, preserving full‑document context in every chunk embedding.
def late_chunk(text: str, chunk_size=512) -> list:
"""Chunk after full‑document embedding"""
# Step 1: embed the whole document (up to model's max tokens)
full_doc_token_embeddings = transformer_embed(text) # token‑level embeddings
# Step 2: define chunk boundaries
tokens = tokenize(text)
chunk_boundaries = range(0, len(tokens), chunk_size)
# Step 3: pool token embeddings for each chunk
chunks_with_embeddings = []
for start in chunk_boundaries:
end = start + chunk_size
chunk_text = detokenize(tokens[start:end])
chunk_embedding = mean_pool(full_doc_token_embeddings[start:end])
chunks_with_embeddings.append((chunk_text, chunk_embedding))
return chunks_with_embeddingsPros: retains full‑document context, leverages long‑context models, better semantic understanding.
Cons: needs a long‑context embedding model, more complex implementation, limited by model’s max token count.
When to use: dense technical docs or contracts where every chunk needs the whole context.
Strategy 11: Domain‑Fine‑Tuned Embeddings
Train a sentence‑transformer on domain‑specific query‑document pairs to improve handling of specialized terminology.
from sentence_transformers import SentenceTransformer, losses
from torch.utils.data import DataLoader
def prepare_training_data():
return [
("What is EBITDA?", "EBITDA (Earnings Before Interest, Taxes, Depreciation, and Amortization) ..."),
("Explain capital expenditures", "Capital expenditures (CapEx) refer to ..."),
# ... thousands more pairs
]
def fine_tune_model():
model = SentenceTransformer('all-MiniLM-L6-v2')
train_examples = prepare_training_data()
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
)
model.save('./fine_tuned_model')
return model
embedding_model = SentenceTransformer('./fine_tuned_model')Pros: typically improves accuracy by 5‑10%, better handles domain jargon, smaller fine‑tuned models can outperform larger generic ones.
Cons: requires labeled training data, training time and resources, needs periodic re‑training as terminology evolves.
When to use: domains where generic embeddings perform poorly (medical, legal, finance, specialized tech).
Combining Strategies for Maximum Impact
Single strategies improve results, but combining them yields transformative gains. Three proven stacks are presented.
Stack 1 – Production‑Ready (Best Overall)
Context‑aware chunking
Re‑ranking
Query expansion
Agent‑based RAG
Achieves ~92% accuracy with ~1.2 s latency at roughly $0.003 per query.
Stack 2 – High‑Accuracy (Critical Applications)
Contextual retrieval
Multi‑query RAG
Re‑ranking
Self‑reflective RAG
Delivers ~96% accuracy, 2.5 s latency, cost ~ $0.008 per query – ideal for medical, legal, or financial use cases where errors are costly.
Stack 3 – Domain‑Expert (Specialized Fields)
Fine‑tuned embeddings
Contextual retrieval
Knowledge‑graph augmentation
Re‑ranking
Provides ~94% accuracy on domain queries with ~1.8 s latency at ~$0.005 per query after the initial training investment.
Implementation Roadmap
Adopt strategies incrementally.
Stage 1 – Foundations (Week 1): Replace fixed‑size chunking with context‑aware chunking, set up basic vector search with proper embeddings, benchmark baseline accuracy.
Stage 2 – Quick Wins (Weeks 2‑3): Add re‑ranking and query expansion, measure improvements.
Stage 3 – Advanced (Weeks 4‑6): Introduce multi‑query or agent‑based RAG, implement self‑reflection for critical queries, start fine‑tuning embeddings if needed.
Stage 4 – Specialization (Month 2+): Deploy contextual retrieval for high‑value docs, integrate knowledge graphs, continuously fine‑tune embeddings.
Common Pitfalls and How to Avoid Them
Using all strategies at once: leads to complexity, debugging difficulty, and high cost. Start with a minimal stack and iterate.
Skipping baseline measurements: without metrics you cannot prove improvements. Always record accuracy and latency before and after changes.
Fixed‑size chunking: destroys semantic flow. Adopt context‑aware or hierarchical chunking.
Ignoring re‑ranking: vector similarity ≠ relevance. Implement cross‑encoder re‑ranking early.
Neglecting query preprocessing: vague queries fail. At minimum add query expansion.
Relying on a single retrieval method: one size does not fit all. Use an agent or multi‑strategy approach for flexibility.
Future Trends in RAG
Smaller, faster models: next‑gen embedding models promise 10× speed with comparable accuracy.
Multimodal RAG: retrieval of images, tables, and charts alongside text for richer context.
Sparse retrieval learning: models like SPLADE combine neural and sparse representations for efficient search.
Conclusion
Building a production‑ready RAG system is less about using the flashiest components and more about systematically addressing the failure modes of naive RAG. Starting with context‑aware chunking and re‑ranking, then layering additional strategies as needed, yields a reliable pipeline that can be measured, tuned, and scaled.
References
GitHub repository – Advanced RAG Strategies: https://github.com/coleam00/ottomator-agents/tree/main/all-rag-strategies
Anthropic Contextual Retrieval research: https://www.anthropic.com/news/contextual-retrieval
Pinecone Re‑ranking guide: https://www.pinecone.io/learn/series/rag/rerankers/
Docling – Context‑aware chunking: https://github.com/DS4SD/docling
Graphiti – Knowledge‑graph RAG: https://github.com/graphiti-ai/graphiti
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
