Artificial Intelligence 11 min read

Why RAG Is Dead: Jeff Huber’s 5 Retrieval Secrets and Context Engineering

Jeff Huber, founder of Chroma, argues that traditional RAG is obsolete, introduces context engineering as the new paradigm, and shares five practical retrieval strategies, a complete pipeline, and insights on handling context rot, memory, and generative benchmarking to build production‑grade AI applications.

Instant Consumer Technology Team

Sep 2, 2025

Why RAG Is Dead: Jeff Huber’s 5 Retrieval Secrets and Context Engineering

Introduction

In the AI era, people are either “information‑flow eaters” overwhelmed by noise or “future pioneers” who wield AI with thoughtful intent. Jeff Huber, founder and CEO of the open‑source vector database Chroma, shares his perspective on why traditional Retrieval‑Augmented Generation (RAG) is dead and why Context Engineering is now king.

Chroma Overview

Chroma is a popular open‑source vector database used in many AI projects, especially RAG pipelines. It can be installed with pip install, has over 5 million monthly installs and more than 22 k GitHub stars, and is featured in the Voyager paper.

Key Insight: RAG Is Dead, Context Engineering Is King

Huber claims that RAG is no longer sufficient; instead, developers should focus on “Context Engineering”. As large language model (LLM) context windows grow, the workload shifts from simple chatbots to powerful agents, making the context window a critical resource.

Five Retrieval Secrets

Stop calling it “RAG”. Clarify the fundamentals: dense retrieval, lexical retrieval, filters, re‑ranking, assembly, and evaluation loops.

First‑stage hybrid recall. Retrieve 200‑300 candidates using a mix of vectors, lexical/regex matches, and metadata filters.

Re‑rank before assembly. Use an LLM or cross‑encoder to reorder candidates, dramatically improving generation quality.

Respect “Context Rot”. Keep context concise and structured rather than filling the window with irrelevant text.

Build a gold dataset. Integrate it into CI and dashboards for continuous, data‑driven evaluation.

Complete Context‑Engineering Pipeline

Ingest

Parse + chunk: Split documents by headings, code blocks, tables, etc.

Enrich: Add titles, anchors, symbols, and metadata.

Optional LLM chunk summaries: Generate natural‑language summaries for code or API sections.

Embeddings: Produce dense vectors, optionally combined with sparse signals.

Write to DB: Store text, vectors, and metadata in a vector database.

Query

First‑stage hybrid: Combine vector similarity, lexical/regex matches, and metadata filters.

Candidate pool: Retrieve roughly 100‑300 candidates.

Re‑rank: Use an LLM or cross‑encoder to select the top 20‑40 results.

Context assembly: Place system prompts first, deduplicate, merge similar items, diversify sources, and enforce token limits.

Outer‑Loop Optimization

Cache / cost guardrails: Prevent runaway expenses.

Generative benchmarking: Use a gold set to quantitatively evaluate retrieval strategies.

Error analysis: Refine chunking, filters, or re‑ranking prompts based on failures.

Memory / compaction: Summarize interaction history into searchable facts.

Why Chroma?

Building production‑grade AI systems is more engineering than alchemy. Chroma aims to turn the “alchemy” of demo‑level prototypes into reliable, scalable infrastructure.

Information Retrieval vs. Search

Modern AI search differs from traditional web search in tools, workloads, developer roles, and consumers. LLMs can consume thousands of links, whereas humans can only handle a handful.

Context Rot

As token count in the context window increases, model performance degrades—a phenomenon called “Context Rot”. Chroma’s “Context Rot” report shows that even top‑tier models suffer noticeable decay with longer contexts.

Memory as the Ultimate Benefit

Memory is the goal; Context Engineering is the means to place the right information in the window at the right time, whether short‑term, long‑term, or offline compaction.

Generative Benchmarking

Instead of guessing, generate high‑quality queries from your document chunks using an LLM, creating a “gold” QA set for systematic evaluation of retrieval pipelines.

Code‑Specific Retrieval

Regex remains essential for known patterns (e.g., searching for cap table).

Embeddings complement lexical search when the exact filename is unknown.

Embedding code can be improved by first generating natural‑language descriptions with an LLM and then embedding those.

Conclusion

Transitioning from “RAG” to “Context Engineering” is a mindset shift that turns AI from a noisy black box into a disciplined engineering discipline, enabling developers to become “context architects” rather than mere data feeders.

AI RAG retrieval Context Engineering Generative Benchmarking

Written by

Instant Consumer Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.