When RAG Returns Junk, Why a LLM Can’t Fix It – Building an Agentic RAG

The article examines why traditional single‑step Retrieval‑Augmented Generation fails when retrieved passages are irrelevant, outlines the three fundamental flaws of that pipeline, and presents the Agentic RAG paradigm—turning retrieval into a reusable tool with planning, reflection, and decision loops, illustrated with code, interview scenarios, and practical deployment tips.

Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
When RAG Returns Junk, Why a LLM Can’t Fix It – Building an Agentic RAG

1. Why Traditional RAG Is a "Dead Link"

Traditional RAG works as a one‑direction pipeline: a user query is embedded, top‑k results are fetched from a vector store, optionally reranked, and then concatenated into a prompt for a large language model (LLM) to generate an answer. No component can loop back.

Dead end 1: Retrieval quality decides everything. If the single retrieval step returns irrelevant documents, the answer is likely wrong, and the system has no way to detect the failure.

Dead end 2: Multi‑hop questions cannot be solved. Complex queries that require sequential look‑ups (e.g., “What was the claim ratio in 2023 compared to 2022 and why did it change?”) need at least three retrievals, but traditional RAG only performs one.

Dead end 3: No assessment of sufficiency. After retrieval the system never evaluates whether the retrieved set is relevant or sufficient; it always proceeds to generation.

These three dead ends stem from the fact that retrieval is a passive, one‑time step rather than an active tool.

Traditional RAG single retrieval: a non‑reversible dead link
Traditional RAG single retrieval: a non‑reversible dead link

2. Core Shift: Making Retrieval a Tool

Agentic RAG’s central idea is to treat retrieval as a tool that an Agent can call repeatedly, rather than a fixed pipeline step.

The loop consists of four actions:

Plan : The Agent decides whether retrieval is needed, what to query, and whether to decompose the problem.

Retrieve : The Agent calls the retrieval tool with a query that may differ from the original user question.

Reflect : The Agent evaluates the returned documents – relevance, sufficiency, and missing pieces.

Decide : Based on reflection, the Agent either generates an answer or formulates a new query (or even switches data sources) and repeats.

This adds “reflection” and “decision” stages that are absent in traditional RAG.

def retrieve(query: str, top_k: int = 5, source: str = "vector") -> list[dict]:
    """Retrieval tool: the Agent decides query / top_k / source.
    Returns documents with source annotation for reflection.
    """
    if source == "vector":
        hits = vector_store.search(embed(query), top_k=top_k)
    elif source == "bm25":
        hits = bm25_index.search(query, top_k=top_k)
    elif source == "web":
        hits = web_search(query, top_k=top_k)
    return [{"text": h.text, "source": h.doc_id, "score": h.score} for h in hits]

TOOLS = {"retrieve": retrieve}

The main loop then becomes:

def agentic_rag(question: str, max_steps: int = 6) -> str:
    history = []
    for step in range(max_steps):
        decision = llm_plan(question, history)  # {action, query, reason}
        if decision["action"] == "answer":
            return llm_generate(question, history)
        docs = retrieve(decision["query"], source=decision.get("source", "vector"))
        reflection = llm_reflect(question, decision["query"], docs)
        history.append({"query": decision["query"], "docs": docs, "reflection": reflection})
    return llm_generate(question, history)  # fallback

Compared with the traditional retrieve → generate straight line, the Agentic version inserts a reflection step after each retrieval and a planning step before the next retrieval, enabling a true feedback loop.

Agentic RAG decision loop: retrieval as a repeatable tool
Agentic RAG decision loop: retrieval as a repeatable tool

3. Three Main Agentic RAG Patterns

While the high‑level idea is the same, three concrete patterns have emerged.

Pattern 1: Self‑RAG

Self‑RAG adds special reflection tokens to the generation process so the model itself decides:

Whether retrieval is needed.

If each retrieved snippet is relevant.

Whether the generated sentence is supported by the retrieved documents.

This internal self‑audit helps suppress hallucinations but requires fine‑tuning on data that contains these tokens.

Pattern 2: CRAG (Corrective RAG)

CRAG inserts a lightweight retrieval evaluator between retrieval and generation. The evaluator classifies the retrieved set as Correct , Incorrect , or Ambiguous . Incorrect results trigger query rewriting or a fallback web search; ambiguous results are merged with web results for the generator to weigh.

def crag_retrieve(question: str) -> list[dict]:
    docs = retrieve(question, source="vector")
    grade = evaluate_retrieval(question, docs)  # correct/incorrect/ambiguous
    if grade == "correct":
        return docs
    elif grade == "incorrect":
        new_query = rewrite_query(question)
        return retrieve(new_query, source="web")
    else:  # ambiguous
        web_docs = retrieve(rewrite_query(question), source="web")
        return docs + web_docs

Pattern 3: Multi‑hop ReAct Retrieval

For questions that cannot be answered with a single lookup, retrieval is embedded inside a ReAct (Reasoning‑and‑Acting) loop. Each hop’s query is derived from the previous hop’s results, enabling true sequential reasoning (e.g., first fetch 2023 data, then 2022 data, then the cause).

Three mainstream Agentic RAG modes: Self‑RAG / CRAG / Multi‑hop ReAct
Three mainstream Agentic RAG modes: Self‑RAG / CRAG / Multi‑hop ReAct

4. A Real Multi‑hop Example

Question: “Why is the 2023 claim ratio for this critical illness product higher than in 2022?”

Traditional RAG: Embeds the whole question, retrieves a batch of passages that mention “claim ratio”, “2023”, and “critical illness”. The set usually lacks the 2022 baseline and the causal analysis, so the LLM either omits the comparison or fabricates a reason.

Agentic RAG: The Agent proceeds step‑by‑step:

Plan: retrieve “2023 critical‑illness claim ratio”. Retrieve → reflect: have 2023 number, still missing 2022 baseline.

Plan: retrieve “2022 critical‑illness claim ratio”. Retrieve → reflect: now have both numbers, but still lack the cause.

Plan: retrieve “reason for 2023 claim‑ratio increase”. CRAG evaluator flags the result as incorrect (too noisy), so the Agent rewrites the query to a more specific phrase and retrieves the correct causal paragraph.

Plan: reflection shows information is sufficient; generate the answer with data, cause, and source citations.

The Agentic path uses four retrievals and multiple model calls, but produces a complete, grounded answer.

Real execution trace of a multi‑hop Agentic RAG query
Real execution trace of a multi‑hop Agentic RAG query

5. Three Pitfalls to Guard Before Production

Pitfall 1: Non‑converging loops. If the knowledge base lacks any relevant content, the Agent may keep rewriting queries forever. Mitigation: (a) hard max_steps limit, (b) duplicate‑query detection via cosine similarity, (c) stop after two consecutive hops that add no new information.

def agentic_rag_safe(question, max_steps=6, sim_threshold=0.92):
    history, past_queries = [], []
    for step in range(max_steps):
        decision = llm_plan(question, history)
        if decision["action"] == "answer":
            return llm_generate(question, history)
        q = decision["query"]
        if any(cosine(embed(q), embed(pq)) > sim_threshold for pq in past_queries):
            break
        past_queries.append(q)
        docs = retrieve(q, source=decision.get("source", "vector"))
        history.append({"query": q, "docs": docs, "reflection": llm_reflect(question, q, docs)})
    return llm_generate(question, history)

Pitfall 2: Cost and latency explosion. A four‑hop query can trigger >10 model calls, turning milliseconds into seconds and inflating cloud bills. Solution: a lightweight complexity classifier at the entry point routes simple FAQ‑style queries to traditional RAG and only sends complex, multi‑hop cases to the Agentic pipeline.

Pitfall 3: Unreliable self‑reflection. The LLM may incorrectly judge noisy results as “sufficient”. Using an independent evaluator (as in CRAG) is more reliable than trusting the model’s own self‑assessment.

6. When to Use Agentic RAG vs. When It Is Over‑Engineering

Decision matrix (simplified):

Simple single‑hop FAQ → Traditional RAG.

Knowledge base with high noise → Traditional RAG + CRAG evaluator.

Multi‑hop, information‑aggregation queries → Multi‑hop Agentic RAG.

Scenarios demanding high answer fidelity (finance, medical) → Self‑RAG with source tagging.

High‑throughput, low‑latency production → Prefer Traditional RAG; avoid Agentic loops.

Capability vs. cost comparison: Traditional RAG vs. Agentic RAG
Capability vs. cost comparison: Traditional RAG vs. Agentic RAG

7. How to Answer an Agentic RAG Interview Question

Four‑step answer template:

State the paradigm difference (≈30 s): traditional RAG is a one‑way pipeline; Agentic RAG makes retrieval a repeatable tool with planning, reflection, and decision.

Describe the three concrete modes (≈40 s): Self‑RAG (reflection tokens), CRAG (lightweight evaluator), Multi‑hop ReAct (retrieval inside reasoning loop).

Give a concrete trace (≈30 s) using the 2023 vs 2022 claim‑ratio example.

Conclude with cost‑aware selection (≈20 s): use Agentic RAG only for multi‑hop or high‑trust scenarios; otherwise stick to traditional RAG and route simple queries away.

Typical follow‑up questions and concise answers are also provided in the article (loop safety, routing criteria, preference between Self‑RAG and CRAG).

Conclusion

The core limitation of traditional RAG is its topology – a single, passive retrieval step. Agentic RAG resolves this by turning retrieval into an active tool that an Agent can call, reflect on, and decide to repeat, yielding a feedback‑driven loop. The three concrete patterns (Self‑RAG, CRAG, Multi‑hop ReAct) address different pain points, and practical production advice (loop guards, cost routing, evaluator reliability) ensures the approach is usable at scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AILLMPrompt EngineeringKnowledge BaseretrievalAgentic RAG
Wu Shixiong's Large Model Academy
Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.