Why Real RAG Systems Need Both BM25 and Vector Search
The article analyzes how BM25 excels at exact token matching while vector embeddings capture semantic intent, explains their distinct failure modes, and shows that a hybrid retriever—combined with metadata filtering, proper chunking, and reciprocal rank fusion—delivers the most reliable results for RAG pipelines.
RAG Overview
Retrieval‑augmented generation (RAG) first selects relevant chunks and then generates answers; the retriever operates on chunks rather than whole documents.
Query Types Shape Retrieval
Technical queries fall into two categories: identifier searches that need precise matching (e.g., API endpoints, error codes) and conceptual searches that require semantic understanding (e.g., “how to fix a database crash”). The retriever must handle both.
BM25 Strengths
BM25 is a lexical ranking function that scores documents based on term frequency and inverse document frequency. The parameters k1 (term‑frequency saturation) and b (length normalization) control how repeated terms and document length affect the score.
import bm25s
import Stemmer
def build_bm25_retriever(corpus: list<str>) -> bm25s.BM25:
stemmer = Stemmer.Stemmer("english")
tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)
retriever = bm25s.BM25(corpus=corpus)
retriever.index(tokens)
return retriever
def bm25_search(retriever: bm25s.BM25, query: str, k: int = 5) -> list<dict>:
stemmer = Stemmer.Stemmer("english")
q_tokens = bm25s.tokenize(query, stemmer=stemmer)
docs, scores = retriever.retrieve(q_tokens, k=k)
results = []
for i in range(docs.shape[1]):
results.append({"content": docs[0, i], "score": float(scores[0, i])})
return resultsBM25 shines on exact identifiers, environment variables, API routes, and version strings because it treats each exact token as a separate match.
Vector Search Strengths
Vector retrieval uses embeddings to map chunks into a high‑dimensional space, enabling semantic similarity matching. Approximate nearest‑neighbor (ANN) search finds the closest vectors to the query embedding.
from qdrant_client import QdrantClient, models
from openai import OpenAI
def build_vector_index(chunks: list<dict>, collection_name: str = "docs") -> QdrantClient:
client = QdrantClient(":memory:")
client.create_collection(
collection_name=collection_name,
vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE),
)
oai = OpenAI()
texts = [c["content"] for c in chunks]
resp = oai.embeddings.create(input=texts, model="text-embedding-3-small")
vectors = [e.embedding for e in resp.data]
points = []
for i, chunk in enumerate(chunks):
points.append(models.PointStruct(id=i, vector=vectors[i], payload={"content": chunk["content"], **chunk.get("meta", {})}))
client.upsert(collection_name=collection_name, points=points)
return client
def vector_search(client: QdrantClient, query: str, collection_name: str = "docs", k: int = 5) -> list<dict>:
oai = OpenAI()
q_vec = oai.embeddings.create(input=[query], model="text-embedding-3-small").data[0].embedding
hits = client.query_points(collection_name=collection_name, query=q_vec, limit=k).points
results = []
for hit in hits:
results.append({"content": hit.payload["content"], "score": hit.score, "id": hit.id})
return resultsVector search excels at semantic queries, handling paraphrases and concepts that lack exact keyword overlap.
Different Failure Modes
BM25 fails on abstract or paraphrased questions because the exact terms are missing; vector search fails on precise identifiers because sub‑word tokenization dilutes exact matches. For example, the token AUTH_JWT_ROTATION_ENABLED is split into sub‑tokens, losing the exact string, while BM25 would match it directly.
Hybrid Search Mechanics
Hybrid retrieval runs BM25 and vector search independently, each producing a top‑K list, then merges them using Reciprocal Rank Fusion (RRF). RRF ignores raw scores and combines rankings, with a constant k (commonly 60) to smooth the contribution of each rank.
def reciprocal_rank_fusion(*result_lists: list[dict], k: int = 60, top_n: int = 5) -> list[dict]:
scores = {}
best_docs = {}
for results in result_lists:
for rank, result in enumerate(results):
doc_id = result.get("id", result["content"][:100])
if doc_id not in scores:
scores[doc_id] = 0.0
scores[doc_id] += 1.0 / (k + rank + 1)
if doc_id not in best_docs:
best_docs[doc_id] = result
ranked_ids = sorted(scores, key=scores.get, reverse=True)[:top_n]
final_results = []
for doc_id in ranked_ids:
doc = best_docs[doc_id].copy()
doc["rrf_score"] = scores[doc_id]
final_results.append(doc)
return final_resultsHybrid search is not a simple concatenation; proper fusion is essential to avoid noisy results.
Metadata Filtering
Even with perfect lexical and semantic signals, missing metadata can return wrong environment documents (e.g., production instead of staging). Adding tags such as service, environment, document type, version, and owner lets the retriever filter candidates before scoring.
Chunking Strategy
Chunk size influences both retrievers: very small chunks reduce BM25 term frequency benefits, while very large chunks dilute vector semantics. Structure‑aware chunking (e.g., keeping headings, code blocks, warnings together) mitigates these issues.
Evaluating Hybrid Retrieval
Evaluations must include identifier, conceptual, and mixed query buckets. The following benchmark script measures top‑K hit rates per retriever and per query type.
from dataclasses import dataclass
@dataclass
class EvalCase:
query: str
expected_substring: str
query_type: str # "identifier", "conceptual", or "mixed"
def evaluate_retrievers(cases: list[EvalCase], retrievers: dict[str, callable], k: int = 5) -> dict:
report = {name: {"total_hits": 0, "by_type": {}} for name in retrievers}
for case in cases:
for name, search_fn in retrievers.items():
results = search_fn(case.query, k=k)
top_contents = [r["content"] for r in results]
found = any(case.expected_substring in content for content in top_contents)
report[name]["total_hits"] += int(found)
q_type = case.query_type
if q_type not in report[name]["by_type"]:
report[name]["by_type"][q_type] = {"hits": 0, "total": 0}
report[name]["by_type"][q_type]["total"] += 1
report[name]["by_type"][q_type]["hits"] += int(found)
total_cases = len(cases)
for name in report:
report[name]["overall_hit_rate"] = report[name]["total_hits"] / total_cases if total_cases else 0
for q_type, stats in report[name]["by_type"].items():
stats["hit_rate"] = stats["hits"] / stats["total"] if stats["total"] else 0
return reportReal‑world corpora show BM25 dominates identifier queries, vectors dominate conceptual queries, and hybrid consistently covers both.
Reranking the Hybrid Output
After fusion, the candidate list may still contain noisy chunks. Because language models have token limits, a second‑stage reranker (cross‑encoders or LLM‑based) re‑scores the merged list to surface the most useful chunks before prompting.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeepHub IMBA
A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
