Artificial Intelligence 11 min read

How Contextual Retrieval Slashes RAG Failures by Up to 67% and Cuts Costs

Anthropic’s Contextual Retrieval augments traditional RAG with contextual embeddings and BM25, reducing retrieval failure rates by 49% (up to 67% with reranking), improving accuracy across domains, and lowering latency and cost through Claude’s prompt‑caching feature.

Baobao Algorithm Notes

Oct 17, 2024

How Contextual Retrieval Slashes RAG Failures by Up to 67% and Cuts Costs

What Is Contextual Retrieval?

Anthropic introduced a method called Contextual Retrieval that enhances the retrieval step in Retrieval‑Augmented Generation (RAG). It combines two sub‑techniques— Contextual Embeddings and Contextual BM25 —to provide richer, chunk‑specific context before indexing.

Why Traditional RAG Falls Short

Standard RAG splits documents into small chunks, embeds them, and stores the vectors in a similarity‑search database. When a chunk lacks sufficient surrounding information, the system may retrieve irrelevant or ambiguous results, especially for queries that require precise identifiers (e.g., financial filings). This loss of context often leads to retrieval failures.

How Contextual Retrieval Works

Before creating embeddings or BM25 indexes, each chunk is prefixed with a concise, automatically generated description that situates the chunk within the whole document. This description is produced by prompting Claude (or a similar LLM) with the full document and the target chunk.

original_chunk = "The company's revenue grew by 3% over the previous quarter."
contextualized_chunk = "This chunk is from an SEC filing on ACME Corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter."

The generated context typically contains 50‑100 tokens and is attached to the chunk before both the embedding and BM25 indexing stages.

Implementation Details

The prompt used with Claude 3 Haiku follows this template:

<document> {{WHOLE_DOCUMENT}} </document>
Here is the chunk we want to situate within the whole document:
<chunk> {{CHUNK_CONTENT}} </chunk>
Please give a short, succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

The resulting context is stored alongside the chunk in both the vector store and the BM25 index.

Performance Gains

Contextual Embeddings alone reduce the top‑20 retrieval‑failure rate by 35% (5.7% → 3.7%).

Combining Contextual Embeddings with Contextual BM25 cuts the failure rate by 49% (5.7% → 2.9%).

Adding a reranking step on top of the combined approach further lowers the failure rate to 67% (5.7% → 1.9%).

Experiments across codebases, novels, arXiv papers, and scientific articles confirm these improvements. The best embedding providers were Voyage and Gemini Text‑004, and using the top 20 chunks (instead of top 10 or 5) yielded the strongest results.

Reranking for Further Improvement

Reranking filters the initial set of retrieved chunks (e.g., top 150) by scoring each chunk together with the user query using a dedicated reranker model. The top K (commonly 20) chunks are then passed to the LLM, reducing latency and cost because the model processes fewer, more relevant pieces of text.

Cost‑Effective Prompt Caching

Claude’s prompt‑caching feature allows the generated contextual snippets to be cached once per document, avoiding repeated API calls. For an 800‑token chunk, an 8k‑token document, and a 50‑token context instruction, the cost drops to roughly $1.02 per million document tokens.