Why Real RAG Systems Need Both BM25 and Vector Search

The article analyzes how BM25 excels at exact token matching while vector embeddings capture semantic intent, explains their distinct failure modes, and shows that a hybrid retriever—combined with metadata filtering, proper chunking, and reciprocal rank fusion—delivers the most reliable results for RAG pipelines.

DeepHub IMBA
DeepHub IMBA
DeepHub IMBA
Why Real RAG Systems Need Both BM25 and Vector Search

RAG Overview

Retrieval‑augmented generation (RAG) first selects relevant chunks and then generates answers; the retriever operates on chunks rather than whole documents.

Query Types Shape Retrieval

Technical queries fall into two categories: identifier searches that need precise matching (e.g., API endpoints, error codes) and conceptual searches that require semantic understanding (e.g., “how to fix a database crash”). The retriever must handle both.

BM25 Strengths

BM25 is a lexical ranking function that scores documents based on term frequency and inverse document frequency. The parameters k1 (term‑frequency saturation) and b (length normalization) control how repeated terms and document length affect the score.

import bm25s
import Stemmer

def build_bm25_retriever(corpus: list<str>) -> bm25s.BM25:
    stemmer = Stemmer.Stemmer("english")
    tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)
    retriever = bm25s.BM25(corpus=corpus)
    retriever.index(tokens)
    return retriever

def bm25_search(retriever: bm25s.BM25, query: str, k: int = 5) -> list<dict>:
    stemmer = Stemmer.Stemmer("english")
    q_tokens = bm25s.tokenize(query, stemmer=stemmer)
    docs, scores = retriever.retrieve(q_tokens, k=k)
    results = []
    for i in range(docs.shape[1]):
        results.append({"content": docs[0, i], "score": float(scores[0, i])})
    return results

BM25 shines on exact identifiers, environment variables, API routes, and version strings because it treats each exact token as a separate match.

Vector Search Strengths

Vector retrieval uses embeddings to map chunks into a high‑dimensional space, enabling semantic similarity matching. Approximate nearest‑neighbor (ANN) search finds the closest vectors to the query embedding.

from qdrant_client import QdrantClient, models
from openai import OpenAI

def build_vector_index(chunks: list<dict>, collection_name: str = "docs") -> QdrantClient:
    client = QdrantClient(":memory:")
    client.create_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE),
    )
    oai = OpenAI()
    texts = [c["content"] for c in chunks]
    resp = oai.embeddings.create(input=texts, model="text-embedding-3-small")
    vectors = [e.embedding for e in resp.data]
    points = []
    for i, chunk in enumerate(chunks):
        points.append(models.PointStruct(id=i, vector=vectors[i], payload={"content": chunk["content"], **chunk.get("meta", {})}))
    client.upsert(collection_name=collection_name, points=points)
    return client

def vector_search(client: QdrantClient, query: str, collection_name: str = "docs", k: int = 5) -> list<dict>:
    oai = OpenAI()
    q_vec = oai.embeddings.create(input=[query], model="text-embedding-3-small").data[0].embedding
    hits = client.query_points(collection_name=collection_name, query=q_vec, limit=k).points
    results = []
    for hit in hits:
        results.append({"content": hit.payload["content"], "score": hit.score, "id": hit.id})
    return results

Vector search excels at semantic queries, handling paraphrases and concepts that lack exact keyword overlap.

Different Failure Modes

BM25 fails on abstract or paraphrased questions because the exact terms are missing; vector search fails on precise identifiers because sub‑word tokenization dilutes exact matches. For example, the token AUTH_JWT_ROTATION_ENABLED is split into sub‑tokens, losing the exact string, while BM25 would match it directly.

Hybrid Search Mechanics

Hybrid retrieval runs BM25 and vector search independently, each producing a top‑K list, then merges them using Reciprocal Rank Fusion (RRF). RRF ignores raw scores and combines rankings, with a constant k (commonly 60) to smooth the contribution of each rank.

def reciprocal_rank_fusion(*result_lists: list[dict], k: int = 60, top_n: int = 5) -> list[dict]:
    scores = {}
    best_docs = {}
    for results in result_lists:
        for rank, result in enumerate(results):
            doc_id = result.get("id", result["content"][:100])
            if doc_id not in scores:
                scores[doc_id] = 0.0
            scores[doc_id] += 1.0 / (k + rank + 1)
            if doc_id not in best_docs:
                best_docs[doc_id] = result
    ranked_ids = sorted(scores, key=scores.get, reverse=True)[:top_n]
    final_results = []
    for doc_id in ranked_ids:
        doc = best_docs[doc_id].copy()
        doc["rrf_score"] = scores[doc_id]
        final_results.append(doc)
    return final_results

Hybrid search is not a simple concatenation; proper fusion is essential to avoid noisy results.

Metadata Filtering

Even with perfect lexical and semantic signals, missing metadata can return wrong environment documents (e.g., production instead of staging). Adding tags such as service, environment, document type, version, and owner lets the retriever filter candidates before scoring.

Chunking Strategy

Chunk size influences both retrievers: very small chunks reduce BM25 term frequency benefits, while very large chunks dilute vector semantics. Structure‑aware chunking (e.g., keeping headings, code blocks, warnings together) mitigates these issues.

Evaluating Hybrid Retrieval

Evaluations must include identifier, conceptual, and mixed query buckets. The following benchmark script measures top‑K hit rates per retriever and per query type.

from dataclasses import dataclass

@dataclass
class EvalCase:
    query: str
    expected_substring: str
    query_type: str  # "identifier", "conceptual", or "mixed"

def evaluate_retrievers(cases: list[EvalCase], retrievers: dict[str, callable], k: int = 5) -> dict:
    report = {name: {"total_hits": 0, "by_type": {}} for name in retrievers}
    for case in cases:
        for name, search_fn in retrievers.items():
            results = search_fn(case.query, k=k)
            top_contents = [r["content"] for r in results]
            found = any(case.expected_substring in content for content in top_contents)
            report[name]["total_hits"] += int(found)
            q_type = case.query_type
            if q_type not in report[name]["by_type"]:
                report[name]["by_type"][q_type] = {"hits": 0, "total": 0}
            report[name]["by_type"][q_type]["total"] += 1
            report[name]["by_type"][q_type]["hits"] += int(found)
    total_cases = len(cases)
    for name in report:
        report[name]["overall_hit_rate"] = report[name]["total_hits"] / total_cases if total_cases else 0
        for q_type, stats in report[name]["by_type"].items():
            stats["hit_rate"] = stats["hits"] / stats["total"] if stats["total"] else 0
    return report

Real‑world corpora show BM25 dominates identifier queries, vectors dominate conceptual queries, and hybrid consistently covers both.

Reranking the Hybrid Output

After fusion, the candidate list may still contain noisy chunks. Because language models have token limits, a second‑stage reranker (cross‑encoders or LLM‑based) re‑scores the merged list to surface the most useful chunks before prompting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonRAGVector SearchEmbeddingBM25Information RetrievalHybrid Retrieval
DeepHub IMBA
Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.