Step‑by‑Step Guide to Implementing a Hybrid Retrieval Function with RRF Fusion

This article breaks down the end‑to‑end retrieval function used in a RAG system, detailing each of the five stages—from request construction, hybrid vector + BM25 search, RRF fusion, cross‑encoder reranking, to threshold filtering—and provides concrete Python code, parameter choices, and performance insights.

Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Step‑by‑Step Guide to Implementing a Hybrid Retrieval Function with RRF Fusion

1. Retrieval Function Overview

The core retrieval function receives a user query and returns the top‑K most relevant document chunks, each with a similarity score and source metadata. The execution flow is divided into five distinct steps.

2. Step 1 – Build Retrieval Request

def retrieval(self, question, embd_mdl, tenant_ids, kb_ids,
              page, page_size, similarity_threshold=0.2,
              vector_similarity_weight=0.3, top=1024,
              rerank_mdl=None, highlight=False):
    ranks = {"total": 0, "chunks": [], "doc_aggs": {}}
    RERANK_PAGE_LIMIT = 3
    req = {
        "kb_ids": kb_ids,
        "size": max(page_size * RERANK_PAGE_LIMIT, 128),
        "question": question,
        "vector": True,
        "topk": top,
        "similarity": similarity_threshold,
    }

Key design decisions:

size = page_size × 3 : ensures enough candidates for the first three pages, which will be reranked.

RERANK_PAGE_LIMIT = 3 : only the first three pages undergo expensive cross‑encoder reranking.

top = 1024 : maximum number of initial candidates fetched from the vector store.

3. Step 2 – Hybrid Retrieval

The function simultaneously performs vector search and BM25 keyword search, then merges the results.

Vector Search

def get_vector(self, txt, emb_mdl, topk=10, similarity=0.1):
    # Generate embedding for the query text
    qv = generate_embedding(txt)
    embedding_data = [float(v) for v in qv]
    vector_column_name = f"q_{len(embedding_data)}_vec"
    return MatchDenseExpr(vector_column_name, embedding_data,
                          'float', 'cosine', topk,
                          {"similarity": similarity})

BM25 Search

After tokenizing the query, a MatchTextExpr is built using Elasticsearch's query_string. The crucial parameter minimum_should_match controls the required proportion of query terms that must match.

Weight Allocation

# Vector weight (default 0.3)
vector_weight = vector_similarity_weight
# BM25 weight = 1 - vector weight (default 0.7)
text_weight = 1.0 - vector_similarity_weight

In insurance‑related scenarios, BM25 receives a higher weight because many queries contain precise terminology.

Fusion Algorithm – RRF (Reciprocal Rank Fusion)

Since vector scores (0‑1 cosine) and BM25 scores (unbounded TF‑IDF) are not directly comparable, RRF merges them by rank only:

def rrf_fusion(vector_results, bm25_results, k=60):
    doc_scores = {}
    # Vector contribution
    for rank, (doc_id, _) in enumerate(vector_results, start=1):
        doc_scores[doc_id] = 1 / (k + rank)
    # BM25 contribution
    for rank, (doc_id, _) in enumerate(bm25_results, start=1):
        if doc_id in doc_scores:
            doc_scores[doc_id] += 1 / (k + rank)
        else:
            doc_scores[doc_id] = 1 / (k + rank)
    # Sort by fused score
    return sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)

The parameter k=60 smooths the contribution of lower‑ranked results; it is a widely‑used empirical value that performed best in benchmark tests.

4. Step 3 – Rerank with Pagination Strategy

Reranking is the most computationally expensive step because each candidate chunk is scored by a cross‑encoder (e.g., bge‑reranker‑large). Only the first three pages are reranked:

if page <= RERANK_PAGE_LIMIT:
    if sres.total > 0:
        sim, tsim, vsim = self.rerank_by_model(
            rerank_mdl, sres, question,
            1 - vector_similarity_weight,  # token weight
            vector_similarity_weight)      # vector weight
        idx = np.argsort(sim * -1)[(page-1)*page_size : page*page_size]
else:
    sim = tsim = vsim = [1] * len(sres.ids)
    idx = list(range(len(sres.ids)))

The rerank_by_model concatenates (query, document) pairs, feeds them to the cross‑encoder, and then combines the cross‑encoder score with the original token and vector similarities.

Score Fusion Inside Rerank

sim : final combined similarity used for sorting and threshold filtering.

tsim : pure token similarity (BM25 contribution).

vsim : pure vector similarity (embedding contribution).

5. Step 4 & 5 – Threshold Filtering and Result Assembly

for i in idx:
    # Filter out low‑score results
    if sim[i] < similarity_threshold:
        break
    # Stop when the current page is full
    if len(ranks["chunks"]) >= page_size:
        break
    chunk = sres.field[id]
    d = {
        "chunk_id": id,
        "content_ltks": chunk["content_ltks"],
        "doc_id": chunk.get("doc_id", ""),
        "docnm_kwd": chunk.get("docnm_kwd", ""),
        "kb_id": chunk["kb_id"],
        "similarity": sim[i],
        "vector_similarity": vsim[i],
        "term_similarity": tsim[i],
        "positions": position_int,
    }
    ranks["chunks"].append(d)
    # Document aggregation statistics
    if dnm not in ranks["doc_aggs"]:
        ranks["doc_aggs"][dnm] = {"doc_id": did, "count": 0}
    ranks["doc_aggs"][dnm]["count"] += 1

The doc_aggs dictionary records how many chunks of each document were retrieved, which can be used to highlight the most relevant documents in the UI.

6. Elasticsearch Index Configuration

Custom tokenizer + domain dictionary : Adding insurance‑specific terms (e.g., “免赔额”, “犹豫期”) as whole tokens raised token‑level recall from 0.79 to 0.84.

Synonym filter : Maps equivalent terms such as “孩子,儿童,小孩,未成年人” to improve recall for varied user phrasing.

Field boosting : The section_title field weight is set to twice that of the body text, promoting chunks whose titles match the query.

7. Interview Presentation Tips

When asked to explain the retrieval implementation in an interview, follow this concise narrative:

Draw the five‑step pipeline (30 s): request → hybrid search → RRF fusion → rerank → filter & return.

Describe hybrid search (40 s): vector search with BGE‑M3 on Milvus HNSW index, BM25 via Elasticsearch with custom tokenizer and synonyms, RRF fusion (k = 60) merges ranks.

Explain rerank strategy (30 s): only the first three pages undergo cross‑encoder reranking; later pages use the initial ranking.

Quantify impact (20 s): RRF improves MRR by ~3 %, custom tokenizer lifts BM25 recall, overall system MRR reaches 0.92.

Providing this structured answer demonstrates both strategic understanding and hands‑on implementation experience.

PythonElasticsearchRAGMilvusHybrid RetrievalCross-EncoderRRF Fusion
Wu Shixiong's Large Model Academy
Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.