Why Hybrid Retrieval Beats Pure Vector Search: BM25, RRF, and Real‑World Gains
This article explains why combining BM25 with dense vector search using Reciprocal Rank Fusion (RRF) improves recall for both exact‑term and semantic queries in a financial‑insurance document corpus, details the underlying algorithms, parameter choices such as k=60, provides Python implementations, and shows measurable performance gains in production.
Why Pure Vector Retrieval Is Insufficient
Before discussing hybrid retrieval, we must understand why vector search alone cannot handle all queries. In a financial‑insurance scenario with 5,000 contracts, users often ask precise questions containing product names and domain‑specific terms, e.g., "What is the waiting period for Ping An Health Insurance 2023?" Vector models encode the whole query into a single embedding and retrieve the most semantically similar passages, but they treat "waiting period" as similar to "observation period" or "exclusion period," potentially returning the wrong product.
This weakness stems from the fact that dense embeddings prioritize semantic similarity and ignore exact keyword matches, leading to low recall for queries that rely on rare or precise terms.
BM25, a classic term‑frequency based retrieval model, excels at exact matching. If a phrase like "Ping An Health Insurance 2023" appears verbatim in a document, BM25 assigns a high score, making it ideal for long‑tail, sparse queries.
In our test set, pure vector retrieval achieved a Recall@5 of 0.48 for exact‑term queries, while BM25 alone reached 0.71. Conversely, for semantic‑rich queries, BM25’s recall dropped to 0.54 whereas vectors achieved 0.79. The complementary strengths motivate a hybrid approach.
BM25 Principle: Term Frequency Saturation and Length Normalization
BM25 (Best Match 25) computes a score based on inverse document frequency (IDF), term frequency (tf), and two hyper‑parameters: k1 (tf saturation) and b (length normalization). The formula can be expressed as:
score = sum_{terms} IDF * ((tf * (k1 + 1)) / (tf + k1 * (1 - b + b * (|D| / avgdl))))IDF gives higher weight to rare terms; for example, "waiting period" has a high IDF in insurance contracts, while the generic word "insurance" has a low IDF.
tf counts term occurrences; BM25 applies a saturation function so that after a certain frequency the contribution levels off.
k1 (default 1.5) controls how quickly tf saturates. In our insurance documents the term distribution is fairly uniform, so the default works well.
b (default 0.75) normalizes for document length, preventing very long contracts from dominating the score.
Implementation in Python using the rank_bm25 library:
from rank_bm25 import BM25Okapi
import jieba
class BM25Retriever:
def __init__(self, documents, k1=1.5, b=0.75):
tokenized_docs = [list(jieba.cut(doc)) for doc in documents]
self.bm25 = BM25Okapi(tokenized_docs, k1=k1, b=b)
self.documents = documents
def retrieve(self, query, top_k=20):
tokenized_query = list(jieba.cut(query))
scores = self.bm25.get_scores(tokenized_query)
top_indices = scores.argsort()[-top_k:][::-1]
return [(idx, scores[idx]) for idx in top_indices]If Elasticsearch is already in use, its built‑in BM25 can be configured as:
{
"settings": {
"similarity": {
"custom_bm25": {
"type": "BM25",
"k1": 1.5,
"b": 0.75
}
}
},
"mappings": {
"properties": {
"content": {"type": "text", "similarity": "custom_bm25"}
}
}
}Vector Retrieval: Embedding Model Selection and Index Structure
Choosing the right Chinese embedding model dramatically affects recall. We evaluated three candidates:
text2vec-large-chinese – 1024‑dim, large index size.
BGE‑large‑zh (BAAI/bge-large-zh-v1.5) – state‑of‑the‑art on MTEB Chinese leaderboard, 1024‑dim.
BGE‑m3 – multilingual, useful for mixed English‑Chinese terms.
Our experiments showed BGE‑large‑zh‑v1.5 outperformed text2vec by ~4 % on the business test set, so we adopted it.
Embedding generation code:
from sentence_transformers import SentenceTransformer
import numpy as np
class VectorRetriever:
def __init__(self, model_name="BAAI/bge-large-zh-v1.5"):
self.model = SentenceTransformer(model_name)
def encode(self, texts, batch_size=32):
if isinstance(texts, str):
texts = "为这个句子生成表示以用于检索相关文章:" + texts
embeddings = self.model.encode(texts, batch_size=batch_size, normalize_embeddings=True)
return embeddings
def retrieve(self, query, index, top_k=20):
query_emb = self.encode(query)
distances, indices = index.search(np.array([query_emb], dtype=np.float32), top_k)
return list(zip(indices[0], distances[0]))Faiss index options:
IVF+PQ – offline batch construction, low memory, suitable for large static corpora. Example configuration: IVF4096, PQ64, ~400 MB for 8 × 10⁴ vectors, QPS in the thousands.
HNSW – higher memory, longer build time, but sub‑5 ms P99 latency, ideal for online services.
HNSW index building code:
import faiss
import numpy as np
def build_hnsw_index(embeddings, dim=1024, M=32, ef_construction=200):
"""M: max connections per node; larger M → higher accuracy, more memory.
ef_construction: search scope during build; larger → higher accuracy, slower build.
"""
index = faiss.IndexHNSWFlat(dim, M)
index.hnsw.efConstruction = ef_construction
index.hnsw.efSearch = 64 # runtime search scope
index.add(np.array(embeddings, dtype=np.float32))
return indexRRF Fusion Algorithm: Where Does k=60 Come From?
Reciprocal Rank Fusion (RRF) merges two ranked lists without needing to normalize their raw scores. The core formula is:
def rrf_fusion(bm25_results, vector_results, k=60):
"""bm25_results / vector_results are ordered doc_id lists.
k is the smoothing parameter (default 60).
Returns a list of (doc_id, fused_score) sorted descending.
"""
scores = {}
for rank, doc_id in enumerate(bm25_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
for rank, doc_id in enumerate(vector_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)The original 2009 SIGIR paper experimented with many k values across several IR benchmarks and found that k = 60 yields the most stable performance, with a relatively flat curve between k = 40 and k = 80 (differences < 1 %). Smaller k values over‑emphasize top ranks, while very large k values flatten the ranking too much.
Illustrative score distribution for different k values:
def compare_k_values(top_n=10):
k_values = [1, 10, 60, 100, 1000]
print(f"{'rank':<6}" + "".join([f"k={k:<12}" for k in k_values]))
for rank in range(top_n):
print(f"{rank+1:<6}" + "".join([f"{1/(k+rank+1):<12.5f}" for k in k_values]))
compare_k_values()Output (excerpt):
rank k=1 k=10 k=60 k=100 k=1000
1 0.50000 0.09091 0.01639 0.00990 0.00100
2 0.33333 0.08333 0.01613 0.00980 0.00100
10 0.09091 0.04762 0.01429 0.00909 0.00099Why Use RRF Instead of Weighted Sum?
Weighted sum requires normalizing BM25 scores (often ranging 0.1–50) and vector similarities (0–1), introducing extra hyper‑parameters and instability across model updates. RRF sidesteps this by operating solely on ranks, making it robust to batch‑to‑batch score shifts.
Weight Tuning: How to Adjust Weights If Needed?
When business needs dictate a bias toward one modality, a weighted RRF can be used:
def weighted_rrf_fusion(bm25_results, vector_results, k=60, alpha=0.5):
"""alpha: weight for BM25 (0‑1). alpha=0.5 equals standard RRF.
"""
scores = {}
for rank, doc_id in enumerate(bm25_results):
scores[doc_id] = scores.get(doc_id, 0) + alpha * (1 / (k + rank + 1))
for rank, doc_id in enumerate(vector_results):
scores[doc_id] = scores.get(doc_id, 0) + (1 - alpha) * (1 / (k + rank + 1))
return sorted(scores.items(), key=lambda x: x[1], reverse=True)We performed a grid search on a 500‑query validation set (covering exact, semantic, and mixed queries) to find the optimal alpha. The best result was alpha = 0.4, giving a Recall@5 of 0.8921, indicating a slight preference for the vector side.
import numpy as np
from tqdm import tqdm
best_alpha = 0.5
best_recall = 0.0
alpha_range = np.arange(0.1, 1.0, 0.1)
for alpha in tqdm(alpha_range):
fusion_results = {}
for query_id, query in validation_queries.items():
bm25_res = bm25_retriever.retrieve(query, top_k=20)
vector_res = vector_retriever.retrieve(query, faiss_index, top_k=20)
bm25_ids = [idx for idx, _ in bm25_res]
vector_ids = [idx for idx, _ in vector_res]
fusion_results[query_id] = weighted_rrf_fusion(bm25_ids, vector_ids, k=60, alpha=alpha)
recall = evaluate_recall_at_k(fusion_results, ground_truth, k=5)
if recall > best_recall:
best_recall = recall
best_alpha = alpha
print(f"Best alpha: {best_alpha:.1f}, Recall@5: {best_recall:.4f}")Even for very small corpora (a few hundred documents), a quick grid search can confirm whether k = 60 remains optimal; in our 5,000‑doc test it was within 0.3 % of the best alternative values.
Practical Impact: What Does Hybrid Retrieval Achieve?
Before deployment, an A/B test on the internal benchmark showed:
Overall Recall@5 increased from 0.72 to 0.89 (+17 %).
Semantic queries (e.g., "how to report an accident") rose from 0.79 to 0.85 (+6 %).
Exact‑term queries (e.g., "exemption clause 3 of critical illness insurance") jumped from 0.51 to 0.78 (+27 %).
Domain‑specific term queries (e.g., "Ping An Health Insurance 2023 waiting period") improved from 0.48 to 0.73 (+25 %).
Latency impact was minimal: BM25 runs in milliseconds, HNSW adds ~5 ms P99, and RRF fusion adds < 1 ms, resulting in an end‑to‑end latency increase from ~7 ms to ~9 ms, well within user‑perception limits.
Full Hybrid Retrieval Implementation
class HybridRetriever:
def __init__(self, documents, embedding_model="BAAI/bge-large-zh-v1.5",
bm25_k1=1.5, bm25_b=0.75, rrf_k=60, alpha=0.5):
self.documents = documents
self.rrf_k = rrf_k
self.alpha = alpha
self.bm25_retriever = BM25Retriever(documents, k1=bm25_k1, b=bm25_b)
self.vector_retriever = VectorRetriever(embedding_model)
print("Building vector index…")
embeddings = self.vector_retriever.encode(documents)
self.faiss_index = build_hnsw_index(embeddings, dim=embeddings.shape[1])
print(f"Index built for {len(documents)} documents")
def retrieve(self, query, top_k=5, bm25_candidate=20, vector_candidate=20):
bm25_res = self.bm25_retriever.retrieve(query, top_k=bm25_candidate)
bm25_ids = [idx for idx, _ in bm25_res]
vector_res = self.vector_retriever.retrieve(query, self.faiss_index, top_k=vector_candidate)
vector_ids = [idx for idx, _ in vector_res]
fused = weighted_rrf_fusion(bm25_ids, vector_ids, k=self.rrf_k, alpha=self.alpha)
top = fused[:top_k]
return [{"doc_id": doc_id, "content": self.documents[doc_id], "rrf_score": score} for doc_id, score in top]How to Answer Hybrid Retrieval in an Interview?
Structure your response in four concise steps (≈30‑40 seconds total):
Background (≈5 s): State the problem – pure vector search missed exact‑term matches, yielding a 0.48 Recall@5 on insurance‑specific queries.
Solution Choice (≈10 s): Explain why you combined BM25 (exact match) with dense vectors (semantic) and used RRF (no score normalization, robust, k = 60 from the original paper).
Implementation Details (≈10 s): Mention top‑20 candidates per lane, RRF fusion, optional weighted RRF with α = 0.4 found via grid search on a 500‑query validation set.
Results (≈10 s): Quote the production numbers – Recall@5 rose from 0.72 to 0.89, domain‑specific queries improved by ~25 %, latency grew only 2 ms.
This concise narrative demonstrates both theoretical understanding and practical impact.
Some Pitfalls
Chinese tokenization quality: Jieba’s default dictionary misses many insurance terms. Adding a custom dictionary boosted BM25 recall by ~8 %.
Candidate set size: Top‑20 per lane balances diversity and noise; too few candidates reduce RRF benefit, too many add irrelevant results.
Model query prefix: BGE models require a specific prefix for queries (e.g., "为这个句子生成表示以用于检索相关文章:"); omitting it drops recall by 3‑5 %.
Summary
Hybrid retrieval solves the core limitation of dense vector search (semantic coverage) and BM25 (exact‑term precision) by merging their ranked results with the simple yet effective RRF formula 1/(k+rank). The empirically chosen k = 60 offers a sweet spot between rank sensitivity and stability. In production, careful engineering of tokenization, candidate set size, and model prompting is as crucial as the algorithm itself.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
