Artificial Intelligence 6 min read

How Agentic‑R Boosts Multi‑Turn Retrieval for LLMs by 2–3 EM Points

This article analyzes the Agentic‑R framework, which upgrades traditional single‑hop Retrieval‑Augmented Generation by introducing dual‑perspective scoring and a bidirectional flywheel, resulting in 2–3 absolute EM improvements across seven QA datasets and a 10–15% reduction in search rounds.

PaperAgent

Jan 27, 2026

How Agentic‑R Boosts Multi‑Turn Retrieval for LLMs by 2–3 EM Points

Background: The Single‑Hop Ceiling of Traditional RAG

Traditional Retrieval‑Augmented Generation (RAG) follows a "one retrieval + one generation" pattern, which fails on multi‑hop reasoning tasks (e.g., "A is older than B?" requires first retrieving A's birth year, then B's). Single‑hop retrieval often propagates errors.

Solution: Dual‑Perspective Scoring + Bidirectional Flywheel

Researchers from Renmin University, Gaoling, and Baidu propose a new training framework for intelligent search that goes beyond the local‑paragraph utility focus of classic RAG. It jointly evaluates:

Local query‑paragraph relevance

Global answer correctness

Both metrics measure a paragraph's true usefulness in multi‑round search.

How Is Training Data Generated?

Given an agent trajectory T = {t₁, q₁, D₁, …, tₙ, A}, for each intermediate query qᵢ we retrieve the top‑20 candidate paragraphs pᵢ,₁ … pᵢ,₂₀ from the corpus and assign two scores:

Local Relevance (LR) : Using Qwen2.5‑72B to produce a list‑wise relevance score (0–100) that encourages paragraphs to directly answer qᵢ. If a "sub‑answer" for qᵢ can be inferred, it is fed to the LLM as reference to reduce hallucination.

Global Answer Correctness (GAC) : Insert paragraph pᵢ,ⱼ back into the agent, let it complete all remaining rounds, and check whether the final answer matches the ground‑truth (Exact Match = 1/0). This upgrades "locally useful" to "globally correct" and filters out high‑similar but misleading passages.

Ranking Rule : First sort by GAC descending, then by LR descending. The top‑1 entry with GAC = 1 and LR ≥ 60 is treated as a positive example; all others become negatives, yielding 16 samples per query.

How Is the Model Trained?

Input : Original question Q concatenated with current query qᵢ using [SEP]. No historical queries are added, as experiments show they introduce noise.

Loss : Contrastive learning with in‑batch and cross‑GPU negative samples, temperature = 0.01.

Initialization : Warm‑start from E5‑base, train for 2 epochs with learning rate 2e‑5.

How Does the Flywheel Operate?

In round k, use the previous Agentic‑Rₖ₋₁ as the environment and train a stronger Agentₖ with PPO.

Generate new trajectories with Agentₖ to construct higher‑quality training data.

Train the next retrieval model Agentic‑Rₖ on this data.

Repeat the two‑round cycle until convergence.

Agentic‑R Uses Two Iterations to Give the Retriever “Long Eyes”

Across seven datasets and three different search agents, Agentic‑R achieves an average gain of 2–3 absolute EM points and reduces the number of search rounds by 10–15%. Performance saturates after two iterations; further training leads to slight degradation.

One Figure Explains Why E5 Fails

E5 treats the phrase "Get Shorty" as a third‑movie title, leading it to retrieve irrelevant "honky‑tonk" passages. In contrast, Agentic‑R directly locks onto "Urban Cowboy" + "Gilley’s Club", reaching the correct answer about Mickey Gilley in a single step.

contrastive learning LLM RAG retrieval augmentation multi-hop reasoning agentic search

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.