When RAG Returns Junk, Why a LLM Can’t Fix It – Building an Agentic RAG
The article examines why traditional single‑step Retrieval‑Augmented Generation fails when retrieved passages are irrelevant, outlines the three fundamental flaws of that pipeline, and presents the Agentic RAG paradigm—turning retrieval into a reusable tool with planning, reflection, and decision loops, illustrated with code, interview scenarios, and practical deployment tips.
1. Why Traditional RAG Is a "Dead Link"
Traditional RAG works as a one‑direction pipeline: a user query is embedded, top‑k results are fetched from a vector store, optionally reranked, and then concatenated into a prompt for a large language model (LLM) to generate an answer. No component can loop back.
Dead end 1: Retrieval quality decides everything. If the single retrieval step returns irrelevant documents, the answer is likely wrong, and the system has no way to detect the failure.
Dead end 2: Multi‑hop questions cannot be solved. Complex queries that require sequential look‑ups (e.g., “What was the claim ratio in 2023 compared to 2022 and why did it change?”) need at least three retrievals, but traditional RAG only performs one.
Dead end 3: No assessment of sufficiency. After retrieval the system never evaluates whether the retrieved set is relevant or sufficient; it always proceeds to generation.
These three dead ends stem from the fact that retrieval is a passive, one‑time step rather than an active tool.
2. Core Shift: Making Retrieval a Tool
Agentic RAG’s central idea is to treat retrieval as a tool that an Agent can call repeatedly, rather than a fixed pipeline step.
The loop consists of four actions:
Plan : The Agent decides whether retrieval is needed, what to query, and whether to decompose the problem.
Retrieve : The Agent calls the retrieval tool with a query that may differ from the original user question.
Reflect : The Agent evaluates the returned documents – relevance, sufficiency, and missing pieces.
Decide : Based on reflection, the Agent either generates an answer or formulates a new query (or even switches data sources) and repeats.
This adds “reflection” and “decision” stages that are absent in traditional RAG.
def retrieve(query: str, top_k: int = 5, source: str = "vector") -> list[dict]:
"""Retrieval tool: the Agent decides query / top_k / source.
Returns documents with source annotation for reflection.
"""
if source == "vector":
hits = vector_store.search(embed(query), top_k=top_k)
elif source == "bm25":
hits = bm25_index.search(query, top_k=top_k)
elif source == "web":
hits = web_search(query, top_k=top_k)
return [{"text": h.text, "source": h.doc_id, "score": h.score} for h in hits]
TOOLS = {"retrieve": retrieve}The main loop then becomes:
def agentic_rag(question: str, max_steps: int = 6) -> str:
history = []
for step in range(max_steps):
decision = llm_plan(question, history) # {action, query, reason}
if decision["action"] == "answer":
return llm_generate(question, history)
docs = retrieve(decision["query"], source=decision.get("source", "vector"))
reflection = llm_reflect(question, decision["query"], docs)
history.append({"query": decision["query"], "docs": docs, "reflection": reflection})
return llm_generate(question, history) # fallbackCompared with the traditional retrieve → generate straight line, the Agentic version inserts a reflection step after each retrieval and a planning step before the next retrieval, enabling a true feedback loop.
3. Three Main Agentic RAG Patterns
While the high‑level idea is the same, three concrete patterns have emerged.
Pattern 1: Self‑RAG
Self‑RAG adds special reflection tokens to the generation process so the model itself decides:
Whether retrieval is needed.
If each retrieved snippet is relevant.
Whether the generated sentence is supported by the retrieved documents.
This internal self‑audit helps suppress hallucinations but requires fine‑tuning on data that contains these tokens.
Pattern 2: CRAG (Corrective RAG)
CRAG inserts a lightweight retrieval evaluator between retrieval and generation. The evaluator classifies the retrieved set as Correct , Incorrect , or Ambiguous . Incorrect results trigger query rewriting or a fallback web search; ambiguous results are merged with web results for the generator to weigh.
def crag_retrieve(question: str) -> list[dict]:
docs = retrieve(question, source="vector")
grade = evaluate_retrieval(question, docs) # correct/incorrect/ambiguous
if grade == "correct":
return docs
elif grade == "incorrect":
new_query = rewrite_query(question)
return retrieve(new_query, source="web")
else: # ambiguous
web_docs = retrieve(rewrite_query(question), source="web")
return docs + web_docsPattern 3: Multi‑hop ReAct Retrieval
For questions that cannot be answered with a single lookup, retrieval is embedded inside a ReAct (Reasoning‑and‑Acting) loop. Each hop’s query is derived from the previous hop’s results, enabling true sequential reasoning (e.g., first fetch 2023 data, then 2022 data, then the cause).
4. A Real Multi‑hop Example
Question: “Why is the 2023 claim ratio for this critical illness product higher than in 2022?”
Traditional RAG: Embeds the whole question, retrieves a batch of passages that mention “claim ratio”, “2023”, and “critical illness”. The set usually lacks the 2022 baseline and the causal analysis, so the LLM either omits the comparison or fabricates a reason.
Agentic RAG: The Agent proceeds step‑by‑step:
Plan: retrieve “2023 critical‑illness claim ratio”. Retrieve → reflect: have 2023 number, still missing 2022 baseline.
Plan: retrieve “2022 critical‑illness claim ratio”. Retrieve → reflect: now have both numbers, but still lack the cause.
Plan: retrieve “reason for 2023 claim‑ratio increase”. CRAG evaluator flags the result as incorrect (too noisy), so the Agent rewrites the query to a more specific phrase and retrieves the correct causal paragraph.
Plan: reflection shows information is sufficient; generate the answer with data, cause, and source citations.
The Agentic path uses four retrievals and multiple model calls, but produces a complete, grounded answer.
5. Three Pitfalls to Guard Before Production
Pitfall 1: Non‑converging loops. If the knowledge base lacks any relevant content, the Agent may keep rewriting queries forever. Mitigation: (a) hard max_steps limit, (b) duplicate‑query detection via cosine similarity, (c) stop after two consecutive hops that add no new information.
def agentic_rag_safe(question, max_steps=6, sim_threshold=0.92):
history, past_queries = [], []
for step in range(max_steps):
decision = llm_plan(question, history)
if decision["action"] == "answer":
return llm_generate(question, history)
q = decision["query"]
if any(cosine(embed(q), embed(pq)) > sim_threshold for pq in past_queries):
break
past_queries.append(q)
docs = retrieve(q, source=decision.get("source", "vector"))
history.append({"query": q, "docs": docs, "reflection": llm_reflect(question, q, docs)})
return llm_generate(question, history)Pitfall 2: Cost and latency explosion. A four‑hop query can trigger >10 model calls, turning milliseconds into seconds and inflating cloud bills. Solution: a lightweight complexity classifier at the entry point routes simple FAQ‑style queries to traditional RAG and only sends complex, multi‑hop cases to the Agentic pipeline.
Pitfall 3: Unreliable self‑reflection. The LLM may incorrectly judge noisy results as “sufficient”. Using an independent evaluator (as in CRAG) is more reliable than trusting the model’s own self‑assessment.
6. When to Use Agentic RAG vs. When It Is Over‑Engineering
Decision matrix (simplified):
Simple single‑hop FAQ → Traditional RAG.
Knowledge base with high noise → Traditional RAG + CRAG evaluator.
Multi‑hop, information‑aggregation queries → Multi‑hop Agentic RAG.
Scenarios demanding high answer fidelity (finance, medical) → Self‑RAG with source tagging.
High‑throughput, low‑latency production → Prefer Traditional RAG; avoid Agentic loops.
7. How to Answer an Agentic RAG Interview Question
Four‑step answer template:
State the paradigm difference (≈30 s): traditional RAG is a one‑way pipeline; Agentic RAG makes retrieval a repeatable tool with planning, reflection, and decision.
Describe the three concrete modes (≈40 s): Self‑RAG (reflection tokens), CRAG (lightweight evaluator), Multi‑hop ReAct (retrieval inside reasoning loop).
Give a concrete trace (≈30 s) using the 2023 vs 2022 claim‑ratio example.
Conclude with cost‑aware selection (≈20 s): use Agentic RAG only for multi‑hop or high‑trust scenarios; otherwise stick to traditional RAG and route simple queries away.
Typical follow‑up questions and concise answers are also provided in the article (loop safety, routing criteria, preference between Self‑RAG and CRAG).
Conclusion
The core limitation of traditional RAG is its topology – a single, passive retrieval step. Agentic RAG resolves this by turning retrieval into an active tool that an Agent can call, reflect on, and decide to repeat, yielding a feedback‑driven loop. The three concrete patterns (Self‑RAG, CRAG, Multi‑hop ReAct) address different pain points, and practical production advice (loop guards, cost routing, evaluator reliability) ensures the approach is usable at scale.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
