Artificial Intelligence 13 min read

When RAG Retrieves the Right Docs but Still Answers Wrong: Insights from Saarland University (ACL 2026)

The article explains why conventional Retrieval‑Augmented Generation often produces incorrect answers despite retrieving relevant documents, introduces the Disco‑RAG framework that adds a structured reading step using argument trees and relation graphs, and shows how this three‑step approach dramatically improves performance on long‑document and ambiguous‑question benchmarks without any model training.

Machine Learning Algorithms & Natural Language Processing

Apr 17, 2026

When RAG Retrieves the Right Docs but Still Answers Wrong: Insights from Saarland University (ACL 2026)

Why Conventional RAG Gets Answers Wrong

RAG (Retrieval‑Augmented Generation) is now a standard technique for deploying large language models, but users notice a critical flaw: even when the system retrieves the correct documents, the generated answer can be absurd. The authors attribute this to the model’s inability to understand the retrieved passages.

In a toy example, a user asks “Can vitamin D supplementation prevent flu?” and the system returns two paragraphs:

Paragraph A: “In winter, adults with low vitamin D levels see a 12% reduction in flu incidence after supplementation.”

Paragraph B: “Large‑scale randomized trials found no statistically significant link between vitamin D supplementation and flu risk.”

Traditional RAG concatenates A and B and feeds them to the model. The model sees the phrase “12% reduction” and answers that vitamin D is effective, completely ignoring the crucial condition “winter + low‑level adults” and the contradictory evidence in B. The authors identify two blind spots:

Within‑paragraph hierarchy: the model cannot distinguish conclusions from premises.

Between‑paragraph relations: the model cannot tell whether paragraphs support, contradict, or complement each other.

Existing Remedies Focus Only on Retrieval

Prior work has tried re‑ranking results, query rewriting, redundancy reduction, or multi‑turn retrieval. These methods improve the “search” stage but still assume that a better set of passages automatically yields better answers, which is false when the passages contain complex logical structures.

Disco‑RAG: Adding a “Read‑Understand” Stage

Disco‑RAG inserts a reading‑comprehension layer between retrieval and generation, leveraging classic Rhetorical Structure Theory (RST) to expose discourse relations. The pipeline consists of three parameter‑free steps:

Argument‑Tree Construction: An LLM breaks each paragraph into elementary discourse units (EDUs), tags each unit as core or auxiliary, and identifies intra‑paragraph relations such as cause, contrast, or elaboration.

Relation‑Graph Building: All retrieved paragraphs are pairwise compared; the system predicts whether the pair is supporting, contradicting, supplementing, or unrelated, and creates a directed graph (e.g., marking A and B as “contrast”).

Outline Generation and Answer Writing: Using the user query, the original passages, the argument trees, and the relation graph, Disco‑RAG first produces a structured outline that lists key evidence, ordering, and how to reconcile contradictions. The model then writes the final answer guided by this outline.

Benchmark Results

Disco‑RAG was evaluated on three authoritative benchmarks without any fine‑tuning:

Long‑Document Reasoning (Loong): Documents range from 10 k to 250 k tokens. Performance gaps widen with length; at 250 k tokens, conventional RAG nearly fails while Disco‑RAG still provides useful answers, even surpassing methods that require dedicated training.

Ambiguous Question Answering (ASQA): Disco‑RAG sets new state‑of‑the‑art scores on all core metrics, and small‑parameter models achieve results comparable to larger, specially designed systems.

Scientific News Summarization (SciNews): Converting academic papers into lay‑person news, Disco‑RAG ranks first on three of four metrics and second on factual consistency.

Component Ablation

Removing any of the three modules (argument tree, relation graph, outline) degrades performance, confirming that each contributes a distinct role. Adding a generic planning step without structural information yields only marginal gains, whereas the full “tree + graph” combination drives the major improvements.

Robustness to Noise and Granularity

When irrelevant passages replace many retrieved results or when paragraph segmentation granularity changes dramatically, conventional RAG’s performance fluctuates wildly, while Disco‑RAG remains stable.

Practical Deployment

The three modules are decoupled from the final generator, allowing mixed‑size model deployment. In experiments, a small 8‑B Llama‑3.1 model handled all structural analysis, while a 70‑B Llama‑3.3 model performed only the final generation. Even an all‑8‑B configuration outperformed a 70‑B vanilla RAG, demonstrating cost‑effective scalability.

Combining with Fine‑Tuning

On the SciNews task, an untrained Disco‑RAG already beats a fine‑tuned vanilla RAG, showing the intrinsic value of discourse structure. When fine‑tuning is added on top of Disco‑RAG, performance improves further, indicating that structural cues and model adaptation are complementary.

Takeaway

Instead of endlessly optimizing the retrieval component, the authors argue that teaching the model to “read” the retrieved text—by exposing paragraph hierarchy and inter‑paragraph discourse—yields far larger gains. This insight applies not only to RAG but also to broader multi‑document reasoning and long‑text understanding tasks, offering a lightweight, plug‑and‑play enhancement for teams deploying retrieval‑augmented systems.

Paper: Disco‑RAG: Discourse‑Aware Retrieval‑Augmented Generation<br/> Link: https://arxiv.org/abs/2601.04377

RAG natural language processing Retrieval-Augmented Generation Disco-RAG long-document reasoning structured reading