What a PRISMA Review Uncovers About Retrieval‑Augmented Generation (RAG)
This systematic PRISMA review analyzes 128 highly‑cited RAG papers, covering five major databases, 343 datasets, a detailed technical roadmap, evaluation metrics from EM to LLM‑as‑Judge, and future research directions, showing that RAG has evolved into a complex, programmable, and auditable distributed system.
1. Research Method: PRISMA 2020 Flowchart
Following PRISMA 2020 guidelines, the authors identified 4,721 records and after screening retained 128 high‑impact papers for systematic analysis.
Figure 1: Literature screening flow – 4,721 records identified, 128 papers included.
2. Technical Panorama of Retrieval‑Augmented Generation (RAG)
The surveyed works are organized into progressive stages, each introducing key innovations and representative approaches:
Pre‑retrieval : structure‑aware chunking (expanding from 100 to 4,000 tokens), metadata enrichment, long‑retrieval units – e.g., Chunking.
Retrieval : hybrid retrieval combining BM25, dense vectors, and knowledge graphs; graph traversal; dynamic triggering – e.g., Hybrid Retrieval.
Post‑retrieval : re‑ranking, context compression, noise injection, token budgeting – e.g., Post Retrieval.
Iteration Control : reflective tokens such as FLARE, RIND, Self‑RAG – e.g., Self‑RAG.
Memory Enhancement : user‑level vector stores, dialogue cache, knowledge‑graph integration – e.g., Memory.
Multi‑Agent Systems : tool‑chain orchestration (RALLE, MEDRAG) and ReAct‑Chain – e.g., Agentic.
Efficiency Compression : token‑level representations (xRAG) and pipeline scheduling (PipeRAG) – e.g., Efficiency.
Multimodal Retrieval : joint image‑text retrieval (MuRAG, Wiki‑LLaVA) – e.g., Multimodal.
These stages illustrate the evolution from a simple retrieve‑then‑generate pipeline to a programmable, explainable, and auditable distributed system.
3. Evaluation Metrics
Four metric families are commonly used to assess RAG systems:
Retrieval : Recall@k, MAP@k, Hit@k – measure recall performance of the retrieval component.
Generation : Exact Match (EM), F1, BLEU, ROUGE, BERTScore – evaluate textual quality of generated answers.
Hallucination : Support, Hallucination Rate, RAGTruth – assess factual consistency.
Human Evaluation : correctness, relevance, user satisfaction – capture subjective user experience.
LLM‑as‑Judge : GPT‑4 scoring, G‑EVAL, SelfCheckGPT – scalable model‑based evaluation.
4. Representative Datasets
The review catalogues 343 datasets; a subset of frequently used resources is listed below:
Natural Questions (NQ) – 323 k samples, open‑domain QA, cited 27 times.
HotPotQA – 113 k samples, multi‑hop QA, cited 26 times.
Wikipedia – 6 M articles, general‑purpose corpus, cited 19 times.
MS MARCO – 1 M passages, retrieval + QA, cited 8 times.
StrategyQA – 2.8 k samples, implicit reasoning, cited 8 times.
These datasets span a wide range of domains and scales, providing a comprehensive data landscape for RAG research. https://arxiv.org/pdf/2508.06401.pdf A Systematic Literature Review of Retrieval‑Augmented Generation: Techniques, Metrics, and Challenges
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
