CRAG Architecture Explained: Fixing Erroneous Retrieval Results Before the Generator
The article analyzes how most RAG pipelines blindly feed retrieved documents to LLMs, introduces CRAG's lightweight evaluator with confidence thresholds, describes its sentence‑level decomposition, filtering, and dual‑knowledge routing, and provides a full implementation walkthrough with a real insurance query example.
CRAG Overview
Most RAG systems treat retrieval as error‑free and feed all documents directly to the generator.
Unfiltered retrieval documents, whether relevant or not, actively mislead the generator, causing RAG performance to be worse than no retrieval.
CRAG Details
CRAG adds a lightweight retrieval evaluator that assigns one of three confidence levels to each document set for a given query.
Score ≥ UPPER_TH (0.7) → "correct": at least one document is sufficiently relevant; the documents are split into sentence‑level fragments, irrelevant fragments are filtered out, and the remaining pieces are recombined into clean internal knowledge (k_in). Score ≤ LOWER_TH (0.3) → "incorrect": all local documents are irrelevant; the query is rewritten and sent to a web search, producing external knowledge (k_ex). Scores between the thresholds are "fuzzy": both internal and external knowledge are used to hedge risk.
Decompose‑and‑Recompose Algorithm
Decompose: split each retrieved document into independent sentences (strips).
Filter: evaluate each strip’s relevance to the query and discard those below the threshold.
Recompose: concatenate the retained strips in original order to form a coherent context string.
In the original paper the fragment‑level scores are produced by a fine‑tuned T5 model. The implementation shares a single LLM for all scoring tasks, sacrificing a small amount of precision for a simpler deployment.
Architecture Diagram
The figure below is taken directly from the CRAG paper.
Step‑by‑Step Walkthrough of the Diagram
Top row (retrieval): the query x is sent to the retriever, which returns raw documents d₁ and d₂.
Left middle (retrieval evaluator): the evaluator scores the documents and routes the flow to the correct (green), fuzzy (orange), or incorrect (red) path.
Right middle, knowledge refinement (green box): on the correct path, d₁ and d₂ are broken into strips₁, strips₂ …; the filter removes irrelevant strips (marked ✗) and the kept strips are recombined into k_in.
Right middle, knowledge search (red box): on the incorrect path, x is rewritten into a keyword‑rich web query q (e.g., “Death of a Batman; screenwriter; Wikipedia”), a web search returns k₁ … kₙ, which are filtered into k_ex.
Bottom row (generation): the generator receives x together with the appropriate knowledge—x + k_in for the correct path, x + k_in + k_ex for the fuzzy path, or x + k_ex for the incorrect path. Unfiltered raw documents never reach the generator.
The generator therefore receives only knowledge that has passed evaluation, filtering, or external verification, eliminating context contamination at the architectural level.
Implementation
State Schema
class State(TypedDict):
question: str # original user question
docs: List[Document] # raw top‑k retrieved blocks
good_docs: List[Document] # documents that passed the evaluator
verdict: str # CORRECT | INCORRECT
reason: str # evaluator’s reasoning (for logging)
strips: List[str] # sentence‑level fragments
kept_strips: List[str] # fragments that survive the filter
refined_context: str # final context sent to the generator
web_query: str # rewritten query for Google search
web_docs: List[Document] # results from Google Custom Search
answer: str # final generated answerNode: Knowledge Retrieval
# Data ingestion (run once at startup)
chunks = RecursiveCharacterTextSplitter(
chunk_size=900, chunk_overlap=150
).split_documents(docs)
retriever = FAISS.from_documents(chunks, embeddings).as_retriever(search_kwargs={'k': 4})
def retrieve_node(state: State):
docs = retriever.invoke(state['question'])
return {'docs': docs}Node: Document Relevance Check
The core node assigns a continuous confidence score (0.0–1.0) to each document. Documents with score ≥ UPPER_TH (0.7) are added to good_docs. If good_docs is empty, the verdict becomes INCORRECT and the workflow proceeds to web search.
UPPER_TH = 0.7 # recommended "correct" threshold
LOWER_TH = 0.3 # below this is "incorrect"
class DocEvalScore(BaseModel):
score: float # 0.0–1.0 confidence
reason: str # short explanation for debugging
def eval_each_doc_node(state: State):
good_docs = []
for i, doc in enumerate(state['docs'], start=1):
decision = doc_eval_llm.invoke(
doc_eval_prompt.format_messages(question=state['question'], doc=doc.page_content)
)
if decision.score >= UPPER_TH:
good_docs.append(doc)
verdict = 'CORRECT' if good_docs else 'INCORRECT'
return {'good_docs': good_docs, 'verdict': verdict}Using a continuous score instead of a hard classification makes the threshold a configurable parameter; for high‑precision domains such as insurance, raising UPPER_TH from 0.7 to 0.85 only requires changing a constant, keeping strategy separate from mechanism.
Node: Query Rewriting
Activated only on the INCORRECT path, this node rewrites the natural‑language question into a keyword‑dense search query, expanding abbreviations, adding domain‑specific terms, and removing colloquial phrasing.
class WebQuery(BaseModel):
query: str
def rewrite_query_node(state: State):
decision = rewrite_llm.invoke(
rewrite_prompt.format_messages(question=state['question'])
)
return {'web_query': decision.query}Node: External Knowledge Search
def web_search_node(state: State):
params = {
'key': os.getenv('GOOGLE_API_KEY'),
'cx': os.getenv('GOOGLE_CSE_ID'),
'q': state['web_query']
}
r = requests.get('https://www.googleapis.com/customsearch/v1', params=params)
web_docs = [
Document(page_content=item['snippet'])
for item in r.json().get('items', [])
]
return {'web_docs': web_docs}Node: Context Assembly
def refine(state: State):
# CORRECT path uses good_docs
# INCORRECT path uses web_docs (good_docs empty)
# Fuzzy path uses both
docs = state['good_docs'] or state['web_docs']
context = '
'.join([d.page_content for d in docs])
return {'refined_context': context}Real‑World Query Example
Query: “What is the claim settlement process for SecureLife critical‑illness insurance?”
FAISS retrieves four blocks from an insurance corpus.
Document relevance scores: 0.92 (directly relevant), 0.78 (relevant), 0.41 (weakly relevant), 0.18 (irrelevant car‑insurance clause).
Verdict: CORRECT – two documents pass the threshold and are fed to the context assembler.
The final context contains only the two relevant blocks; the irrelevant car‑insurance text is discarded, preventing the generator from mixing domains.
Without filtering, the generator could mistakenly combine the car‑insurance clause with the critical‑illness answer, producing an incorrect response. CRAG’s architecture blocks this failure mode.
Code repository: https://github.com/bhavyameghnani/Corrective-RAG-Self-Reflective-RAG
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeepHub IMBA
A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
