Artificial Intelligence 13 min read

CRAG Architecture Explained: Fixing Erroneous Retrieval Results Before the Generator

The article analyzes how most RAG pipelines blindly feed retrieved documents to LLMs, introduces CRAG's lightweight evaluator with confidence thresholds, describes its sentence‑level decomposition, filtering, and dual‑knowledge routing, and provides a full implementation walkthrough with a real insurance query example.

DeepHub IMBA

Mar 18, 2026

CRAG Architecture Explained: Fixing Erroneous Retrieval Results Before the Generator

CRAG Overview

Most RAG systems treat retrieval as error‑free and feed all documents directly to the generator.

Unfiltered retrieval documents, whether relevant or not, actively mislead the generator, causing RAG performance to be worse than no retrieval.

CRAG Details

CRAG adds a lightweight retrieval evaluator that assigns one of three confidence levels to each document set for a given query.

Score ≥ UPPER_TH (0.7) → "correct": at least one document is sufficiently relevant; the documents are split into sentence‑level fragments, irrelevant fragments are filtered out, and the remaining pieces are recombined into clean internal knowledge (k_in). Score ≤ LOWER_TH (0.3) → "incorrect": all local documents are irrelevant; the query is rewritten and sent to a web search, producing external knowledge (k_ex). Scores between the thresholds are "fuzzy": both internal and external knowledge are used to hedge risk.

Decompose‑and‑Recompose Algorithm

Decompose: split each retrieved document into independent sentences (strips).

Filter: evaluate each strip’s relevance to the query and discard those below the threshold.

Recompose: concatenate the retained strips in original order to form a coherent context string.

In the original paper the fragment‑level scores are produced by a fine‑tuned T5 model. The implementation shares a single LLM for all scoring tasks, sacrificing a small amount of precision for a simpler deployment.

Architecture Diagram

The figure below is taken directly from the CRAG paper.

Step‑by‑Step Walkthrough of the Diagram

Top row (retrieval): the query x is sent to the retriever, which returns raw documents d₁ and d₂.

Left middle (retrieval evaluator): the evaluator scores the documents and routes the flow to the correct (green), fuzzy (orange), or incorrect (red) path.

Right middle, knowledge refinement (green box): on the correct path, d₁ and d₂ are broken into strips₁, strips₂ …; the filter removes irrelevant strips (marked ✗) and the kept strips are recombined into k_in.

Right middle, knowledge search (red box): on the incorrect path, x is rewritten into a keyword‑rich web query q (e.g., “Death of a Batman; screenwriter; Wikipedia”), a web search returns k₁ … kₙ, which are filtered into k_ex.

Bottom row (generation): the generator receives x together with the appropriate knowledge—x + k_in for the correct path, x + k_in + k_ex for the fuzzy path, or x + k_ex for the incorrect path. Unfiltered raw documents never reach the generator.

The generator therefore receives only knowledge that has passed evaluation, filtering, or external verification, eliminating context contamination at the architectural level.

Implementation

State Schema

class State(TypedDict):
    question: str                # original user question
    docs: List[Document]         # raw top‑k retrieved blocks
    good_docs: List[Document]    # documents that passed the evaluator
    verdict: str                 # CORRECT | INCORRECT
    reason: str                  # evaluator’s reasoning (for logging)
    strips: List[str]            # sentence‑level fragments
    kept_strips: List[str]       # fragments that survive the filter
    refined_context: str          # final context sent to the generator
    web_query: str               # rewritten query for Google search
    web_docs: List[Document]     # results from Google Custom Search
    answer: str                  # final generated answer

Node: Knowledge Retrieval

# Data ingestion (run once at startup)
chunks = RecursiveCharacterTextSplitter(
    chunk_size=900, chunk_overlap=150
).split_documents(docs)
retriever = FAISS.from_documents(chunks, embeddings).as_retriever(search_kwargs={'k': 4})

def retrieve_node(state: State):
    docs = retriever.invoke(state['question'])
    return {'docs': docs}

Node: Document Relevance Check

The core node assigns a continuous confidence score (0.0–1.0) to each document. Documents with score ≥ UPPER_TH (0.7) are added to good_docs. If good_docs is empty, the verdict becomes INCORRECT and the workflow proceeds to web search.

UPPER_TH = 0.7   # recommended "correct" threshold
LOWER_TH = 0.3   # below this is "incorrect"

class DocEvalScore(BaseModel):
    score: float   # 0.0–1.0 confidence
    reason: str    # short explanation for debugging

def eval_each_doc_node(state: State):
    good_docs = []
    for i, doc in enumerate(state['docs'], start=1):
        decision = doc_eval_llm.invoke(
            doc_eval_prompt.format_messages(question=state['question'], doc=doc.page_content)
        )
        if decision.score >= UPPER_TH:
            good_docs.append(doc)
    verdict = 'CORRECT' if good_docs else 'INCORRECT'
    return {'good_docs': good_docs, 'verdict': verdict}

Using a continuous score instead of a hard classification makes the threshold a configurable parameter; for high‑precision domains such as insurance, raising UPPER_TH from 0.7 to 0.85 only requires changing a constant, keeping strategy separate from mechanism.

Node: Query Rewriting

Activated only on the INCORRECT path, this node rewrites the natural‑language question into a keyword‑dense search query, expanding abbreviations, adding domain‑specific terms, and removing colloquial phrasing.

class WebQuery(BaseModel):
    query: str

def rewrite_query_node(state: State):
    decision = rewrite_llm.invoke(
        rewrite_prompt.format_messages(question=state['question'])
    )
    return {'web_query': decision.query}

Node: External Knowledge Search

def web_search_node(state: State):
    params = {
        'key': os.getenv('GOOGLE_API_KEY'),
        'cx':  os.getenv('GOOGLE_CSE_ID'),
        'q':   state['web_query']
    }
    r = requests.get('https://www.googleapis.com/customsearch/v1', params=params)
    web_docs = [
        Document(page_content=item['snippet'])
        for item in r.json().get('items', [])
    ]
    return {'web_docs': web_docs}

Node: Context Assembly

def refine(state: State):
    # CORRECT path uses good_docs
    # INCORRECT path uses web_docs (good_docs empty)
    # Fuzzy path uses both
    docs = state['good_docs'] or state['web_docs']
    context = '

'.join([d.page_content for d in docs])
    return {'refined_context': context}

Real‑World Query Example

Query: “What is the claim settlement process for SecureLife critical‑illness insurance?”

FAISS retrieves four blocks from an insurance corpus.

Document relevance scores: 0.92 (directly relevant), 0.78 (relevant), 0.41 (weakly relevant), 0.18 (irrelevant car‑insurance clause).

Verdict: CORRECT – two documents pass the threshold and are fed to the context assembler.

The final context contains only the two relevant blocks; the irrelevant car‑insurance text is discarded, preventing the generator from mixing domains.

Without filtering, the generator could mistakenly combine the car‑insurance clause with the critical‑illness answer, producing an incorrect response. CRAG’s architecture blocks this failure mode.

Code repository: https://github.com/bhavyameghnani/Corrective-RAG-Self-Reflective-RAG

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM LangChain RAG FAISS CRAG knowledge refinement retrieval evaluation

Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.