How Contextual Retrieval Slashes RAG Failures by Up to 67% and Cuts Costs

Anthropic’s Contextual Retrieval augments traditional RAG with contextual embeddings and BM25, reducing retrieval failure rates by 49% (up to 67% with reranking), improving accuracy across domains, and lowering latency and cost through Claude’s prompt‑caching feature.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
How Contextual Retrieval Slashes RAG Failures by Up to 67% and Cuts Costs

What Is Contextual Retrieval?

Anthropic introduced a method called Contextual Retrieval that enhances the retrieval step in Retrieval‑Augmented Generation (RAG). It combines two sub‑techniques— Contextual Embeddings and Contextual BM25 —to provide richer, chunk‑specific context before indexing.

Why Traditional RAG Falls Short

Standard RAG splits documents into small chunks, embeds them, and stores the vectors in a similarity‑search database. When a chunk lacks sufficient surrounding information, the system may retrieve irrelevant or ambiguous results, especially for queries that require precise identifiers (e.g., financial filings). This loss of context often leads to retrieval failures.

How Contextual Retrieval Works

Before creating embeddings or BM25 indexes, each chunk is prefixed with a concise, automatically generated description that situates the chunk within the whole document. This description is produced by prompting Claude (or a similar LLM) with the full document and the target chunk.

original_chunk = "The company's revenue grew by 3% over the previous quarter."
contextualized_chunk = "This chunk is from an SEC filing on ACME Corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter."

The generated context typically contains 50‑100 tokens and is attached to the chunk before both the embedding and BM25 indexing stages.

Implementation Details

The prompt used with Claude 3 Haiku follows this template:

<document> {{WHOLE_DOCUMENT}} </document>
Here is the chunk we want to situate within the whole document:
<chunk> {{CHUNK_CONTENT}} </chunk>
Please give a short, succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

The resulting context is stored alongside the chunk in both the vector store and the BM25 index.

Performance Gains

Contextual Embeddings alone reduce the top‑20 retrieval‑failure rate by 35% (5.7% → 3.7%).

Combining Contextual Embeddings with Contextual BM25 cuts the failure rate by 49% (5.7% → 2.9%).

Adding a reranking step on top of the combined approach further lowers the failure rate to 67% (5.7% → 1.9%).

Experiments across codebases, novels, arXiv papers, and scientific articles confirm these improvements. The best embedding providers were Voyage and Gemini Text‑004, and using the top 20 chunks (instead of top 10 or 5) yielded the strongest results.

Reranking for Further Improvement

Reranking filters the initial set of retrieved chunks (e.g., top 150) by scoring each chunk together with the user query using a dedicated reranker model. The top K (commonly 20) chunks are then passed to the LLM, reducing latency and cost because the model processes fewer, more relevant pieces of text.

Cost‑Effective Prompt Caching

Claude’s prompt‑caching feature allows the generated contextual snippets to be cached once per document, avoiding repeated API calls. For an 800‑token chunk, an 8k‑token document, and a 50‑token context instruction, the cost drops to roughly $1.02 per million document tokens.

Key Takeaways

Embedding + BM25 outperforms embeddings alone.

Voyage and Gemini provide the strongest embeddings.

Passing the top 20 chunks to the model yields better answers than fewer chunks.

Adding contextual information to each chunk dramatically improves retrieval accuracy.

Reranking consistently boosts performance.

All benefits are additive; the best pipeline combines contextual embeddings, contextual BM25, reranking, and a top‑20 chunk window.

AIRAGEmbeddingBM25prompt cachingContextual Retrieval
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.