Artificial Intelligence 17 min read

Boost RAG Retrieval Accuracy with Contextual Embeddings and BM25

This article presents a contextual retrieval technique that combines contextual embeddings and contextual BM25 to reduce RAG miss rates by up to 67%, explains the underlying methods, implementation steps, cost considerations, experimental results, and practical deployment guidance.

JavaEdge

Oct 2, 2024

Boost RAG Retrieval Accuracy with Contextual Embeddings and BM25

Introduction

To make AI models useful in specific domains, they need access to background knowledge—for example, a customer‑service chatbot must understand the business it serves, and a legal‑analysis bot must know past cases.

Developers typically extend model knowledge with Retrieval‑Augmented Generation (RAG), which retrieves relevant information from a knowledge base and appends it to the prompt. Traditional RAG loses context when encoding documents, leading to missed retrievals.

This article introduces contextual retrieval , which reduces miss rates by 49% (and by 67% when combined with re‑ranking) by using two sub‑techniques:

Contextual Embeddings

Contextual BM25

The approach can be deployed with a simple guide.

When Long Prompts Suffice

If a knowledge base is smaller than ~200 k tokens (≈500 pages), the entire corpus can be placed directly in the prompt, avoiding RAG altogether. Claude’s prompt‑caching feature further reduces latency and cost.

1. RAG Overview for Large Knowledge Bases

RAG processes large corpora by:

Splitting the corpus into small text chunks (few hundred tokens each).

Encoding each chunk with an embedding model.

Storing embeddings in a vector database for semantic similarity search.

At query time, the vector store returns the most semantically similar chunks, which are added to the prompt.

Pure embeddings may miss exact term matches. BM25, a classic TF‑IDF‑based ranking function, excels at exact lexical matches (e.g., error code "TS‑999"). Combining embeddings with BM25 improves recall.

Standard RAG pipeline with both techniques:

Chunk the corpus.

Create TF‑IDF codes and semantic embeddings for each chunk.

Use BM25 for exact‑match retrieval.

Use embeddings for semantic retrieval.

Fuse results with a ranking‑fusion step and deduplicate.

Add the top‑K chunks to the prompt.

Context Problem in Traditional RAG

When chunks are too small, they may lack sufficient context. For example, a financial document might contain the sentence "Revenue grew 3%" without specifying the company or quarter, making accurate retrieval difficult.

2. Introducing Contextual Retrieval

Contextual retrieval adds explanatory context to each chunk before embedding ("contextual embeddings") and builds a BM25 index on the enriched text ("contextual BM25").

Example transformation:

Original chunk = "Revenue grew 3%."</code><code>Contextualized chunk = "This chunk comes from ACME Corp's 2023 Q2 SEC filing; the prior quarter's revenue was $314M. Revenue grew 3%."

Claude 3 Haiku is used to generate the contextual text automatically via the following prompt:

<document>{{WHOLE_DOCUMENT}}</document></code><code>这是我们希望在整个文档中定位的块</code><code><chunk>{{CHUNK_CONTENT}}</chunk></code><code>请提供简短的上下文，以便在文档中更好地定位此块以改进搜索检索。只回答简短的上下文，别无其他。

The generated context (≈50‑100 tokens) is prepended to each chunk before embedding and BM25 indexing.

Cost: With Claude’s prompt‑caching, generating contextual chunks costs about $1.02 per million document tokens.

Methodology & Results

Experiments across codebases, novels, ArXiv papers, and scientific articles evaluated various embedding models, retrieval strategies, and metrics (Recall@20). Using Gemini Text‑004 as the embedding model, contextual retrieval reduced the miss rate for the top‑20 chunks by 35% (5.7% → 3.7%). Combining contextual embeddings with contextual BM25 lowered it further by 49% (5.7% → 2.9%). Adding a re‑ranking step (Cohere reranker) brought the miss rate down to 1.9% (67% reduction).

Performance gains are consistent across domains and embedding models, with Voyage and Gemini performing best.

Implementation Considerations

Chunk boundaries : Size, overlap, and splitting strategy affect retrieval quality.

Embedding model : While all models benefit, Gemini and Voyage show the strongest improvements.

Custom context prompts : Tailoring prompts to specific domains (e.g., adding a glossary) can further boost results.

Number of chunks : Experiments found 20 chunks per prompt optimal, but this should be tuned per use case.

Always evaluate the impact of contextual retrieval on your downstream task.

3. Re‑ranking for Additional Gains

After the initial retrieval (e.g., top‑150 chunks), a re‑ranking model scores each chunk against the user query. The top‑K (e.g., 20) are then passed to the generation model, reducing latency and cost.

Perform initial retrieval (top‑N).

Feed the N chunks and the query to a re‑ranking model.

Select the top‑K based on re‑ranking scores.

Provide the K chunks as context to the LLM.

Using Cohere’s reranker, the combined contextual embeddings + contextual BM25 + re‑ranking pipeline achieved a 67% reduction in top‑20 miss rate.

Cost & Latency Trade‑offs

Re‑ranking adds a runtime step, increasing latency modestly, but the parallel nature of scoring mitigates impact. Balancing the number of chunks versus latency is essential; experiments suggest 20 chunks as a sweet spot.

4. Conclusion

Comprehensive testing shows that:

Embedding + BM25 outperforms embeddings alone.

Voyage and Gemini are the strongest embedding models.

Using 20 chunks per prompt yields better results than 5 or 10.

Adding contextual information dramatically improves retrieval accuracy.

Re‑ranking further boosts performance.

All improvements are additive—stacking contextual embeddings, contextual BM25, and re‑ranking yields the highest gains.

Developers are encouraged to follow the provided operational guide to experiment with these techniques and unlock higher retrieval performance.

Standard RAG system using embeddings and BM25

AI RAG Embedding BM25 Re‑ranking Contextual Retrieval

Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.