20 Cutting‑Edge RAG Optimization Techniques: From Semantic Chunking to Self‑RAG
This article systematically presents twenty practical RAG (Retrieval‑Augmented Generation) optimization methods—covering semantic chunking, chunk‑size evaluation, context‑enhanced retrieval, query transformation, re‑ranking, feedback loops, multimodal and graph RAG, hierarchical retrieval, HyDE, Self‑RAG and reinforcement‑learning‑enhanced RAG—each with clear Python code examples, advantages, limitations and ideal use‑cases.
Method 1: Semantic Chunking
Core idea: split text at sentence‑level similarity breakpoints instead of fixed character counts. When adjacent sentences have similarity below a threshold, a chunk boundary is created to preserve semantic integrity.
# ========== Method 1: Semantic Chunking ==========
# Compute breakpoints based on similarity
def compute_breakpoints(similarities, method="percentile", threshold=90):
"""Calculate breakpoint indices where similarity drops sharply.
method: percentile / standard_deviation / interquartile
"""
breakpoints = []
if method == "percentile":
threshold_value = np.percentile(similarities, 100 - threshold)
for i, sim in enumerate(similarities):
if sim < threshold_value:
breakpoints.append(i)
elif method == "standard_deviation":
mean_sim = np.mean(similarities)
std_sim = np.std(similarities)
threshold_value = mean_sim - std_sim
for i, sim in enumerate(similarities):
if sim < threshold_value:
breakpoints.append(i)
elif method == "interquartile":
q1 = np.percentile(similarities, 25)
q3 = np.percentile(similarities, 75)
iqr = q3 - q1
threshold_value = q1 - 1.5 * iqr
for i, sim in enumerate(similarities):
if sim < threshold_value:
breakpoints.append(i)
return breakpoints
def split_into_chunks(sentences, breakpoints):
"""Split a list of sentences into chunks according to breakpoints."""
chunks = []
start = 0
for bp in breakpoints:
chunks.append(" ".join(sentences[start:bp+1]))
start = bp + 1
if start < len(sentences):
chunks.append(" ".join(sentences[start:]))
return chunksAdvantages : avoids arbitrary character limits, keeps semantic coherence. Limitations : requires similarity scores and a small annotation or automatic test set; computing breakpoints adds overhead.
Method 2: Chunk‑Size Evaluation
Core idea: evaluate multiple chunk sizes (e.g., 128, 256, 512 tokens) by indexing, retrieving and generating, then compare faithfulness and relevance metrics to pick the optimal size.
# ========== Method 2: Chunk‑Size Evaluation ==========
chunk_sizes = [128, 256, 512]
text_chunks_dict = {size: chunk_text(extracted_text, size, size//5) for size in chunk_sizes}
for size in chunk_sizes:
chunks = text_chunks_dict[size]
embeddings = create_embeddings(chunks)
for question in test_questions:
results = semantic_search(question, chunks, embeddings)
context = results[0][0]
response = generate_response(question, context)
faithfulness = evaluate_faithfulness(response, context)
relevancy = evaluate_relevancy(response, question)Advantages : data‑driven selection of the best chunk length. Limitations : requires a labeled test set and incurs extra compute for multiple passes.
Method 3: Context‑Enhanced Retrieval
Core idea: after retrieving the most relevant chunk, also fetch its immediate neighbours (previous and next chunks) to provide surrounding context and avoid “isolated fragment” problems.
# ========== Method 3: Context‑Enhanced Retrieval ==========
def context_enriched_search(query, text_chunks, embeddings, k=1, context_size=1):
"""Retrieve top‑k chunks and expand each with surrounding context_size chunks on both sides."""
query_embedding = create_embeddings(query)
similarities = [cosine_similarity(query_embedding, emb) for emb in embeddings]
sorted_indices = np.argsort(similarities)[::-1]
results = []
for idx in sorted_indices[:k]:
start = max(0, idx - context_size)
end = min(len(text_chunks), idx + context_size + 1)
context_window = " ".join(text_chunks[start:end])
results.append({"text": context_window, "similarity": similarities[idx], "center_chunk": idx})
return resultsAdvantages : simple, no index changes, improves answers that need surrounding information. Limitations : may introduce irrelevant neighbouring text, increasing token usage.
Method 4: Context‑Chunk Header Extraction (CCH)
Core idea: generate a concise descriptive title for each chunk using an LLM, embed both the chunk text and its title, and during retrieval combine text‑similarity and title‑similarity scores.
# ========== Method 4: Context‑Chunk Header Extraction (CCH) ==========
def generate_chunk_header(chunk):
"""Use LLM to create a short title summarising the chunk."""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": "为以下文本生成一个简洁的描述性标题。"},
{"role": "user", "content": chunk}]
)
return response.choices[0].message.content.strip()
def search_with_headers(query, chunks, text_embeddings, header_embeddings):
"""Retrieve by averaging similarity of text and title embeddings."""
query_embedding = create_embeddings(query)
results = []
for i in range(len(chunks)):
sim_text = cosine_similarity(query_embedding, text_embeddings[i])
sim_header = cosine_similarity(query_embedding, header_embeddings[i])
avg_similarity = (sim_text + sim_header) / 2
results.append({"text": chunks[i], "similarity": avg_similarity})
results.sort(key=lambda x: x["similarity"], reverse=True)
return resultsAdvantages : titles capture high‑level topics, improving retrieval for conceptual queries. Limitations : extra LLM calls increase cost and latency.
Method 5: Document‑Enhanced RAG (Question Generation)
Core idea: for each chunk, generate several possible questions it can answer; store both the chunk and its generated questions in the vector store. At query time, the system can match the query against either the original chunk or the synthetic questions.
# ========== Method 5: Document‑Enhanced RAG (Question Generation) ==========
def generate_questions(text_chunk, num_questions=5):
"""Ask LLM to produce N questions that the chunk can answer."""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": "你是一个问题生成专家。"},
{"role": "user", "content": f"基于以下文本,生成{num_questions}个该文本能够回答的问题。
{text_chunk}"}]
)
questions = response.choices[0].message.content.strip().split("
")
return [q.strip() for q in questions if q.strip()]
# Indexing example
vector_store = SimpleVectorStore()
for chunk in chunks:
chunk_emb = create_embeddings(chunk)
vector_store.add_item(chunk, chunk_emb, metadata={"type": "chunk"})
for q in generate_questions(chunk):
q_emb = create_embeddings(q)
vector_store.add_item(q, q_emb, metadata={"type": "question", "source_chunk": chunk})Advantages : bridges the lexical gap between user questions and document phrasing. Limitations : extra LLM calls and storage overhead.
Method 6: Query Transformation
Core idea: before retrieval, transform the original user query via three strategies—rewriting, step‑back (generating a broader background question), and sub‑query decomposition.
# ========== Method 6: Query Transformation ==========
def rewrite_query(original_query):
"""Rewrite a vague or conversational query into a clearer, more specific form."""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": "你是一个查询重写专家。只输出重写后的查询。"},
{"role": "user", "content": original_query}]
)
return response.choices[0].message.content.strip()
def generate_step_back_query(original_query):
"""Create a broader background question to retrieve supporting context first."""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": "你是一个查询生成专家。生成一个更宽泛的‘回退’问题。只输出回退查询。"},
{"role": "user", "content": original_query}]
)
return response.choices[0].message.content.strip()
def decompose_query(original_query):
"""Split a complex query into 2‑4 simpler sub‑queries."""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": "将以下复杂问题分解为2‑4个更简单的子问题。每行一个子问题,只输出子问题。"},
{"role": "user", "content": original_query}]
)
sub_queries = response.choices[0].message.content.strip().split("
")
return [sq.strip() for sq in sub_queries if sq.strip()]Advantages : improves recall for ambiguous or multi‑step queries without changing the index. Limitations : extra LLM calls increase latency; quality depends on prompts.
Method 7: Re‑ranking (Reranking)
Core idea: after an initial vector or BM25 retrieval, use an LLM to score each candidate document on a 0‑10 relevance scale and reorder accordingly.
# ========== Method 7: Re‑ranking (Reranking) ==========
def rerank_with_llm(query, documents, model="gpt-3.5-turbo"):
"""Score each document with LLM and sort by descending score."""
reranked = []
for doc in documents:
response = client.chat.completions.create(
model=model,
messages=[{"role": "system", "content": "你是一个相关性评估专家。只输出0‑10分数。"},
{"role": "user", "content": f"查询: {query}
文档: {doc[\"text\"]}"}]
)
score = float(response.choices[0].message.content.strip())
reranked.append({**doc, "rerank_score": score})
reranked.sort(key=lambda x: x["rerank_score"], reverse=True)
return rerankedAdvantages : can dramatically improve top‑k precision. Limitations : requires a separate LLM call per candidate, so cost grows with candidate set size.
Method 8: Fusion Retrieval
Core idea: run both dense vector search and sparse BM25 search, min‑max normalise each score, then combine with a weight α (default 0.5) to produce a final score.
# ========== Method 8: Fusion Retrieval ==========
def bm25_search(bm25, chunks, query, k=5):
"""BM25 keyword search returning top‑k chunks with BM25 scores."""
query_tokens = query.split()
scores = bm25.get_scores(query_tokens)
results = []
for i, score in enumerate(scores):
results.append({"text": chunks[i]["text"], "metadata": {"index": i}, "bm25_score": float(score)})
results.sort(key=lambda x: x["bm25_score"], reverse=True)
return results[:k]
def fusion_retrieval(query, chunks, vector_store, bm25_index, k=5, alpha=0.5):
"""Combine vector and BM25 scores, normalise, and return top‑k fused results."""
epsilon = 1e-8
query_emb = create_embeddings(query)
vector_results = vector_store.similarity_search_with_scores(query_emb, k=len(chunks))
bm25_results = bm25_search(bm25_index, chunks, query, k=len(chunks))
vector_scores = np.array([next((r["similarity"] for r in vector_results if r["metadata"]["index"] == i), 0.0) for i in range(len(chunks))])
bm25_scores = np.array([next((r["bm25_score"] for r in bm25_results if r["metadata"]["index"] == i), 0.0) for i in range(len(chunks))])
v_min, v_max = vector_scores.min(), vector_scores.max()
b_min, b_max = bm25_scores.min(), bm25_scores.max()
norm_vector = (vector_scores - v_min) / (v_max - v_min + epsilon)
norm_bm25 = (bm25_scores - b_min) / (b_max - b_min + epsilon)
final_scores = alpha * norm_vector + (1 - alpha) * norm_bm25
top_indices = np.argsort(final_scores)[::-1][:k]
return [{"text": chunks[i]["text"], "score": final_scores[i]} for i in top_indices]Advantages : leverages complementary strengths of semantic similarity and exact term matching. Limitations : needs both indexes and normalisation logic.
Method 9: Context Compression
Core idea: after retrieval, compress each chunk to keep only the portion most relevant to the query, using one of three strategies—selective sentence filtering, summarisation, or extraction.
# ========== Method 9: Context Compression ==========
def compress_chunk(chunk, query, compression_type="selective", model="gpt-3.5-turbo"):
"""Compress a retrieved chunk according to the chosen strategy."""
if compression_type == "selective":
system_prompt = "You are an expert at information filtering. Return ONLY sentences that are directly relevant to the query. Preserve original wording and order."
elif compression_type == "summary":
system_prompt = "Summarise the chunk focusing ONLY on information relevant to the query."
else: # extraction
system_prompt = "Extract key facts or structured information relevant to the query."
response = client.chat.completions.create(
model=model,
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Query: {query}
Document chunk:
{chunk}"}]
)
return response.choices[0].message.contentAdvantages : reduces token usage and noise, improving generation quality. Limitations : extra LLM call per chunk; aggressive compression may discard needed details.
Method 10: Feedback Loop RAG
Core idea: record query‑response‑rating triples, use an LLM to judge whether past feedback is relevant to the current query, adjust similarity scores accordingly, and continuously enrich the vector store with high‑quality Q&A pairs.
# ========== Method 10: Feedback Loop RAG ==========
def assess_feedback_relevance(query, doc_text, feedback):
"""LLM decides if past feedback is relevant to current query and document."""
system_prompt = "You are an AI system that determines if a past feedback is relevant to a current query and document. Answer ONLY 'yes' or 'no'."
user_prompt = f"Query: {query}
Past query: {feedback['query']}
Document: {doc_text[:500]}...
Past response: {feedback['response'][:500]}..."
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}],
temperature=0)
return "yes" in response.choices[0].message.content.lower()
def adjust_relevance_scores(query, results, feedback_data):
if not feedback_data:
return results
for i, result in enumerate(results):
doc = result["text"]
relevant = [fb for fb in feedback_data if assess_feedback_relevance(query, doc, fb)]
if relevant:
avg = sum(fb["relevance"] for fb in relevant) / len(relevant)
modifier = 0.5 + avg / 5.0
original = result["similarity"]
result["similarity"] = original * modifier
result["feedback_applied"] = True
results.sort(key=lambda x: x["similarity"], reverse=True)
return resultsAdvantages : continuously leverages real user signals to improve ranking. Limitations : needs sufficient high‑quality feedback; noisy feedback can hurt performance.
Method 11: Adaptive Retrieval
Core idea: classify the query into one of four types (Factual, Analytical, Opinion, Contextual) and dispatch it to a specialised retrieval strategy tailored for that type.
# ========== Method 11: Adaptive Retrieval ==========
def classify_query(query, model="gpt-3.5-turbo"):
"""Classify query into Factual / Analytical / Opinion / Contextual."""
system_prompt = "Classify the given query into exactly one of these categories: Factual, Analytical, Opinion, Contextual. Return ONLY the category name."
response = client.chat.completions.create(
model=model,
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Classify this query: {query}"}],
temperature=0)
cat = response.choices[0].message.content.strip()
for valid in ["Factual", "Analytical", "Opinion", "Contextual"]:
if valid.lower() in cat.lower():
return valid
return "Factual"
def adaptive_retrieval(query, vector_store, k=4, user_context=None):
qtype = classify_query(query)
if qtype == "Factual":
return factual_retrieval_strategy(query, vector_store, k)
elif qtype == "Analytical":
return analytical_retrieval_strategy(query, vector_store, k)
elif qtype == "Opinion":
return opinion_retrieval_strategy(query, vector_store, k)
else:
return contextual_retrieval_strategy(query, vector_store, k, user_context)Advantages : a single system can handle diverse query intents efficiently. Limitations : mis‑classification propagates errors; requires well‑designed per‑type strategies.
Method 12: Self‑RAG
Core idea: a multi‑stage decision pipeline that (1) decides whether retrieval is needed, (2) retrieves documents if needed, (3) evaluates relevance, (4) generates a response, (5) assesses how well the response is supported by the context, (6) rates utility, and (7) selects the best answer based on combined scores.
# ========== Method 12: Self‑RAG ==========
def determine_if_retrieval_needed(query):
system_prompt = "Decide if retrieval is necessary for the query. Answer ONLY 'Yes' or 'No'."
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Query: {query}
Is retrieval necessary?"}],
temperature=0)
return "yes" in response.choices[0].message.content.lower()
def evaluate_relevance(query, context):
system_prompt = "Rate how relevant the document is to the query on a scale from 0 to 1. Output ONLY the number."
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Query: {query}
Document: {context}"}],
temperature=0)
return float(response.choices[0].message.content.strip())
def assess_support(response, context):
system_prompt = "Determine if the response is Fully supported, Partially supported, or No support by the context. Output ONLY the category."
response_llm = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Context: {context}
Response: {response}"}],
temperature=0)
return response_llm.choices[0].message.content.strip().lower()
def rate_utility(query, response):
system_prompt = "Rate the utility of the response on a scale from 1 to 5. Output ONLY the number."
resp = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Query: {query}
Response: {response}"}],
temperature=0)
return int(re.search(r"[1-5]", resp.choices[0].message.content).group())
def generate_response(query, context=None):
system_prompt = "You are a helpful AI assistant. Answer the query using only the provided context. If the context lacks the answer, say you don't have enough information."
user_prompt = f"Context:
{context}
Query: {query}" if context else f"Query: {query}"
resp = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}],
temperature=0.2)
return resp.choices[0].message.content.strip()
def self_rag(query, vector_store, top_k=3):
retrieval_needed = determine_if_retrieval_needed(query)
metrics = {"retrieval_needed": retrieval_needed, "documents_retrieved": 0, "relevant_documents": 0,
"response_support_ratings": [], "utility_ratings": []}
best_response = None
best_score = -1
if retrieval_needed:
query_emb = create_embeddings(query)
results = vector_store.similarity_search(query_emb, k=top_k)
metrics["documents_retrieved"] = len(results)
relevant_contexts = []
for i, r in enumerate(results):
rel = evaluate_relevance(query, r["text"]).lower()
if rel == "relevant":
relevant_contexts.append(r["text"])
metrics["relevant_documents"] = len(relevant_contexts)
if relevant_contexts:
for i, ctx in enumerate(relevant_contexts):
resp = generate_response(query, ctx)
support = assess_support(resp, ctx)
metrics["response_support_ratings"].append(support)
utility = rate_utility(query, resp)
metrics["utility_ratings"].append(utility)
support_score = {"fully supported": 3, "partially supported": 1, "no support": 0}.get(support, 0)
overall = support_score * 5 + utility
if overall > best_score:
best_score = overall
best_response = resp
else:
best_response = generate_response(query)
else:
best_response = generate_response(query)
metrics["best_score"] = best_score
metrics["used_retrieval"] = retrieval_needed and best_score > 0
return {"query": query, "response": best_response, "metrics": metrics}Advantages : reduces hallucination by skipping unnecessary retrieval, adds multiple evaluation checkpoints, and selects the highest‑scoring answer. Limitations : many LLM calls increase latency and cost; requires careful prompt engineering for each checkpoint.
Method 13: Proposition Chunking
Core idea: break each text block into atomic, self‑contained propositions (single facts) using an LLM, then index those propositions for fine‑grained retrieval.
# ========== Method 13: Proposition Chunking ==========
def generate_propositions(chunk):
"""Ask LLM to split a chunk into independent factual propositions."""
system_prompt = "Break down the following text into simple, self‑contained propositions. Each proposition should express a single fact, be understandable without context, and use full names. Output ONLY the list."
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Text to convert:
{chunk}"}],
temperature=0)
raw = response.choices[0].message.content.strip().split("
")
clean = []
for prop in raw:
cleaned = re.sub(r"^\s*(\d+\.|\-|\*)\s*", "", prop).strip()
if cleaned and len(cleaned) > 10:
clean.append(cleaned)
return cleanAdvantages : enables ultra‑fine retrieval for fact‑oriented QA. Limitations : high preprocessing cost and larger index size.
Method 14: Multimodal RAG
Core idea: extract both text and images from PDFs, generate captions for images with a vision‑language model, and index the combined text + captions for joint retrieval.
# ========== Method 14: Multimodal RAG ==========
def extract_content_from_pdf(pdf_path, output_dir=None):
"""Extract per‑page text and images, saving images to output_dir."""
text_data, image_paths = [], []
with fitz.open(pdf_path) as pdf_file:
for page_number in range(len(pdf_file)):
page = pdf_file[page_number]
txt = page.get_text().strip()
if txt:
text_data.append({"content": txt, "metadata": {"page": page_number+1, "type": "text"}})
for img_index, img in enumerate(page.get_images(full=True)):
xref = img[0]
base_image = pdf_file.extract_image(xref)
if base_image:
img_path = os.path.join(output_dir, f"page_{page_number+1}_img_{img_index+1}.{base_image['ext']}")
with open(img_path, "wb") as f:
f.write(base_image["image"])
image_paths.append({"path": img_path, "metadata": {"page": page_number+1, "type": "image"}})
return text_data, image_paths
def generate_image_caption(image_path):
"""Use a VLM (e.g., LLaVA) to produce a descriptive caption for an image."""
base64_image = encode_image(image_path)
response = client.chat.completions.create(
model="llava-hf/llava-1.5-7b-hf",
messages=[{"role": "system", "content": "Describe images from academic papers in detail."},
{"role": "user", "content": [{"type": "text", "text": "Describe this image in detail:"},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}]}]
)
return response.choices[0].message.contentAdvantages : enables “text + image” retrieval, useful for figure‑heavy documents. Limitations : requires image extraction and a capable VLM.
Method 15: Fusion Retrieval (Vector + BM25)
Core idea: run dense vector search and BM25 in parallel, min‑max normalise each score, then blend with a weight α (default 0.5) to obtain a final ranking.
# ========== Method 15: Fusion Retrieval ==========
# (implementation identical to Method 8 – see above)Advantages : combines semantic and lexical signals for more robust recall. Limitations : needs both indexes and careful normalisation.
Method 16: Graph RAG
Core idea: treat each chunk as a node, extract concepts/entities as node attributes, and connect nodes with edges weighted by shared concepts and embedding similarity. Retrieval can then traverse the graph to gather related chunks.
# ========== Method 16: Graph RAG ==========
def build_knowledge_graph(chunks):
"""Create a graph where nodes are chunks with concept lists and edges reflect shared concepts + similarity."""
graph = nx.Graph()
texts = [c["text"] for c in chunks]
embeddings = create_embeddings(texts)
for i, chunk in enumerate(chunks):
concepts = extract_concepts(chunk["text"]) # LLM‑based concept extraction
graph.add_node(i, text=chunk["text"], concepts=concepts, embedding=embeddings[i])
for i in range(len(chunks)):
for j in range(i+1, len(chunks)):
shared = set(graph.nodes[i]["concepts"]).intersection(set(graph.nodes[j]["concepts"]))
if shared:
sim = np.dot(embeddings[i], embeddings[j]) / (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j]))
concept_score = len(shared) / min(len(graph.nodes[i]["concepts"]), len(graph.nodes[j]["concepts"]))
weight = 0.7 * sim + 0.3 * concept_score
if weight > 0.6:
graph.add_edge(i, j, weight=weight, shared_concepts=list(shared))
return graph, embeddingsAdvantages : captures cross‑chunk relationships, useful for knowledge‑graph‑like queries. Limitations : graph construction and traversal can be expensive for large corpora.
Method 17: Hierarchical Retrieval
Core idea: build a two‑level index—summaries (e.g., per‑page or per‑chapter) at the top level and fine‑grained chunks at the second level. First retrieve relevant summaries, then only search within the corresponding pages.
# ========== Method 17: Hierarchical Retrieval ==========
def process_document_hierarchically(pdf_path, chunk_size=1000, chunk_overlap=200):
pages = extract_text_from_pdf(pdf_path)
summaries = []
for i, page in enumerate(pages):
summary = generate_page_summary(page["text"])
summaries.append({"text": summary, "metadata": {**page["metadata"], "is_summary": True}})
detailed_chunks = []
for page in pages:
detailed_chunks.extend(chunk_text(page["text"], page["metadata"], chunk_size, chunk_overlap))
summary_store = SimpleVectorStore()
detailed_store = SimpleVectorStore()
# (Add embeddings to stores – omitted for brevity)
return summary_store, detailed_store
def retrieve_hierarchically(query, summary_store, detailed_store, k_summaries=3, k_chunks=5):
query_emb = create_embeddings(query)
summary_results = summary_store.similarity_search(query_emb, k=k_summaries)
relevant_pages = [r["metadata"]["page"] for r in summary_results]
def page_filter(metadata):
return metadata["page"] in relevant_pages
detailed_results = detailed_store.similarity_search(query_emb, k=k_chunks*len(relevant_pages), filter_func=page_filter)
return detailed_resultsAdvantages : scales to very large corpora by limiting fine‑grained search to a small subset of pages. Limitations : quality of the first‑level summaries heavily influences recall.
Method 18: Hypothetical Document Embedding (HyDE)
Core idea: generate a pseudo‑answer document from the query using an LLM, embed that document, and use its embedding for retrieval instead of the raw query embedding.
# ========== Method 18: Hypothetical Document Embedding (HyDE) ==========
def generate_hypothetical_document(query, desired_length=1000):
system_prompt = f"You are an expert document creator. Given a question, generate a detailed document (~{desired_length} characters) that directly answers it. Write as if from an authoritative source."
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Question: {query}
Generate a document:"}],
temperature=0.1)
return response.choices[0].message.content
def hyde_rag(query, vector_store, k=5):
hypo_doc = generate_hypothetical_document(query)
hypo_emb = create_embeddings([hypo_doc])[0]
retrieved = vector_store.similarity_search(hypo_emb, k=k)
response = generate_response(query, retrieved)
return {"query": query, "hypothetical_document": hypo_doc, "retrieved_chunks": retrieved, "response": response}Advantages : bridges the semantic gap for short queries; no index changes required. Limitations : extra LLM generation and embedding step adds latency; quality depends on the generated pseudo‑document.
Method 19: Dynamic Correction RAG (CRAG)
Core idea: after initial retrieval, score each document’s relevance (0‑1). If the highest score is high, use only local documents; if low, fall back to web search; if medium, combine both.
# ========== Method 19: Dynamic Correction RAG (CRAG) ==========
def evaluate_document_relevance(query, document):
system_prompt = "Rate how relevant the given document is to the query on a scale from 0 to 1. Provide ONLY the score as a float."
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Query: {query}
Document: {document}"}],
temperature=0,
max_tokens=5)
match = re.search(r"(\d+(?:\.\d+)?)", response.choices[0].message.content)
return float(match.group(1)) if match else 0.5
def crag_process(query, vector_store, k=3):
query_emb = create_embeddings(query)
retrieved = vector_store.similarity_search(query_emb, k=k)
scores = [evaluate_document_relevance(query, doc["text"]) for doc in retrieved]
max_score = max(scores) if scores else 0
if max_score > 0.7:
final_knowledge = retrieved[scores.index(max_score)]["text"]
elif max_score < 0.3:
web_results, _ = perform_web_search(query)
final_knowledge = refine_knowledge(web_results)
else:
best_doc = retrieved[scores.index(max_score)]["text"]
web_results, _ = perform_web_search(query)
final_knowledge = refine_knowledge(best_doc) + "
" + refine_knowledge(web_results)
return generate_response(query, final_knowledge)Advantages : avoids hallucinations when local knowledge is insufficient; adapts to availability of up‑to‑date web information. Limitations : needs a reliable relevance evaluator and a web‑search backend.
Method 20: Reinforcement‑Learning‑Enhanced RAG
Core idea: model the RAG pipeline as a reinforcement‑learning problem where the state includes the query, retrieved context and past feedback, actions are query rewriting, context expansion, filtering or generation, and the reward is similarity between the generated answer and a ground‑truth answer.
# ========== Method 20: Reinforcement‑Learning‑Enhanced RAG ==========
def calculate_reward(response, ground_truth):
resp_emb = generate_embeddings([response])[0]
gt_emb = generate_embeddings([ground_truth])[0]
return cosine_similarity(resp_emb, gt_emb)
def expand_context(query, current_chunks, top_k=3):
additional = retrieve_relevant_chunks(query, top_k=top_k + len(current_chunks))
new = [c for c in additional if c not in current_chunks]
return current_chunks + new[:top_k]
def filter_context(query, context_chunks):
q_emb = generate_embeddings([query])[0]
chunk_embs = [generate_embeddings([c])[0] for c in context_chunks]
scores = [cosine_similarity(q_emb, e) for e in chunk_embs]
sorted_chunks = [c for _, c in sorted(zip(scores, context_chunks), reverse=True)]
return sorted_chunks[:min(5, len(sorted_chunks))]
def policy_network(state, action_space, epsilon=0.2):
if np.random.random() < epsilon:
return np.random.choice(action_space)
if len(state["previous_responses"]) == 0:
return "rewrite_query"
if state["previous_rewards"] and max(state["previous_rewards"]) < 0.7:
return "expand_context"
if len(state["context"]) > 5:
return "filter_context"
return "generate_response"Advantages : end‑to‑end optimisation using real feedback can discover sophisticated strategies. Limitations : requires a reward signal (ground‑truth or user feedback) and a stable training loop.
Overall Summary
The article provides a comprehensive toolbox of twenty RAG optimisation techniques, ranging from low‑level chunking strategies and query transformations to system‑level feedback loops, multimodal extensions, graph‑based retrieval, hierarchical indexing, and reinforcement‑learning‑driven end‑to‑end optimisation. Each method is accompanied by concise Python code, a discussion of its core idea, advantages, limitations and typical use‑cases, enabling practitioners to mix and match the approaches that best fit their domain and performance requirements.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
