Step‑by‑Step Guide to Implementing a Hybrid Retrieval Function with RRF Fusion
This article breaks down the end‑to‑end retrieval function used in a RAG system, detailing each of the five stages—from request construction, hybrid vector + BM25 search, RRF fusion, cross‑encoder reranking, to threshold filtering—and provides concrete Python code, parameter choices, and performance insights.
1. Retrieval Function Overview
The core retrieval function receives a user query and returns the top‑K most relevant document chunks, each with a similarity score and source metadata. The execution flow is divided into five distinct steps.
2. Step 1 – Build Retrieval Request
def retrieval(self, question, embd_mdl, tenant_ids, kb_ids,
page, page_size, similarity_threshold=0.2,
vector_similarity_weight=0.3, top=1024,
rerank_mdl=None, highlight=False):
ranks = {"total": 0, "chunks": [], "doc_aggs": {}}
RERANK_PAGE_LIMIT = 3
req = {
"kb_ids": kb_ids,
"size": max(page_size * RERANK_PAGE_LIMIT, 128),
"question": question,
"vector": True,
"topk": top,
"similarity": similarity_threshold,
}Key design decisions:
size = page_size × 3 : ensures enough candidates for the first three pages, which will be reranked.
RERANK_PAGE_LIMIT = 3 : only the first three pages undergo expensive cross‑encoder reranking.
top = 1024 : maximum number of initial candidates fetched from the vector store.
3. Step 2 – Hybrid Retrieval
The function simultaneously performs vector search and BM25 keyword search, then merges the results.
Vector Search
def get_vector(self, txt, emb_mdl, topk=10, similarity=0.1):
# Generate embedding for the query text
qv = generate_embedding(txt)
embedding_data = [float(v) for v in qv]
vector_column_name = f"q_{len(embedding_data)}_vec"
return MatchDenseExpr(vector_column_name, embedding_data,
'float', 'cosine', topk,
{"similarity": similarity})BM25 Search
After tokenizing the query, a MatchTextExpr is built using Elasticsearch's query_string. The crucial parameter minimum_should_match controls the required proportion of query terms that must match.
Weight Allocation
# Vector weight (default 0.3)
vector_weight = vector_similarity_weight
# BM25 weight = 1 - vector weight (default 0.7)
text_weight = 1.0 - vector_similarity_weightIn insurance‑related scenarios, BM25 receives a higher weight because many queries contain precise terminology.
Fusion Algorithm – RRF (Reciprocal Rank Fusion)
Since vector scores (0‑1 cosine) and BM25 scores (unbounded TF‑IDF) are not directly comparable, RRF merges them by rank only:
def rrf_fusion(vector_results, bm25_results, k=60):
doc_scores = {}
# Vector contribution
for rank, (doc_id, _) in enumerate(vector_results, start=1):
doc_scores[doc_id] = 1 / (k + rank)
# BM25 contribution
for rank, (doc_id, _) in enumerate(bm25_results, start=1):
if doc_id in doc_scores:
doc_scores[doc_id] += 1 / (k + rank)
else:
doc_scores[doc_id] = 1 / (k + rank)
# Sort by fused score
return sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)The parameter k=60 smooths the contribution of lower‑ranked results; it is a widely‑used empirical value that performed best in benchmark tests.
4. Step 3 – Rerank with Pagination Strategy
Reranking is the most computationally expensive step because each candidate chunk is scored by a cross‑encoder (e.g., bge‑reranker‑large). Only the first three pages are reranked:
if page <= RERANK_PAGE_LIMIT:
if sres.total > 0:
sim, tsim, vsim = self.rerank_by_model(
rerank_mdl, sres, question,
1 - vector_similarity_weight, # token weight
vector_similarity_weight) # vector weight
idx = np.argsort(sim * -1)[(page-1)*page_size : page*page_size]
else:
sim = tsim = vsim = [1] * len(sres.ids)
idx = list(range(len(sres.ids)))The rerank_by_model concatenates (query, document) pairs, feeds them to the cross‑encoder, and then combines the cross‑encoder score with the original token and vector similarities.
Score Fusion Inside Rerank
sim : final combined similarity used for sorting and threshold filtering.
tsim : pure token similarity (BM25 contribution).
vsim : pure vector similarity (embedding contribution).
5. Step 4 & 5 – Threshold Filtering and Result Assembly
for i in idx:
# Filter out low‑score results
if sim[i] < similarity_threshold:
break
# Stop when the current page is full
if len(ranks["chunks"]) >= page_size:
break
chunk = sres.field[id]
d = {
"chunk_id": id,
"content_ltks": chunk["content_ltks"],
"doc_id": chunk.get("doc_id", ""),
"docnm_kwd": chunk.get("docnm_kwd", ""),
"kb_id": chunk["kb_id"],
"similarity": sim[i],
"vector_similarity": vsim[i],
"term_similarity": tsim[i],
"positions": position_int,
}
ranks["chunks"].append(d)
# Document aggregation statistics
if dnm not in ranks["doc_aggs"]:
ranks["doc_aggs"][dnm] = {"doc_id": did, "count": 0}
ranks["doc_aggs"][dnm]["count"] += 1The doc_aggs dictionary records how many chunks of each document were retrieved, which can be used to highlight the most relevant documents in the UI.
6. Elasticsearch Index Configuration
Custom tokenizer + domain dictionary : Adding insurance‑specific terms (e.g., “免赔额”, “犹豫期”) as whole tokens raised token‑level recall from 0.79 to 0.84.
Synonym filter : Maps equivalent terms such as “孩子,儿童,小孩,未成年人” to improve recall for varied user phrasing.
Field boosting : The section_title field weight is set to twice that of the body text, promoting chunks whose titles match the query.
7. Interview Presentation Tips
When asked to explain the retrieval implementation in an interview, follow this concise narrative:
Draw the five‑step pipeline (30 s): request → hybrid search → RRF fusion → rerank → filter & return.
Describe hybrid search (40 s): vector search with BGE‑M3 on Milvus HNSW index, BM25 via Elasticsearch with custom tokenizer and synonyms, RRF fusion (k = 60) merges ranks.
Explain rerank strategy (30 s): only the first three pages undergo cross‑encoder reranking; later pages use the initial ranking.
Quantify impact (20 s): RRF improves MRR by ~3 %, custom tokenizer lifts BM25 recall, overall system MRR reaches 0.92.
Providing this structured answer demonstrates both strategic understanding and hands‑on implementation experience.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
