Artificial Intelligence 17 min read

What Are the Best Practices for Retrieval‑Augmented Generation (RAG)?

This comprehensive study evaluates various components of Retrieval‑Augmented Generation pipelines—including query classification, chunking, embedding models, vector databases, retrieval, re‑ranking, summarization, and generator fine‑tuning—identifies optimal configurations, and proposes best‑practice guidelines for both performance‑maximizing and efficiency‑balanced RAG systems.

Baobao Algorithm Notes

Dec 15, 2024

What Are the Best Practices for Retrieval‑Augmented Generation (RAG)?

Abstract Retrieval‑augmented generation (RAG) excels at integrating up‑to‑date information, reducing hallucinations, and improving response quality, especially in specialized domains. Existing RAG methods often suffer from complex implementation and long latency. This paper surveys current RAG approaches, experiments with numerous configurations, and proposes deployment strategies that balance performance and efficiency. Multimodal retrieval techniques are also shown to boost visual‑question answering.

RAG Workflow

The workflow consists of multiple modules—query classification, chunking, embedding, vector store, retrieval, re‑ranking, document re‑wrapping, summarization, and generator fine‑tuning. For each module we review common methods and select default and alternative approaches for the final pipeline.

Query Classification

Not every query requires augmentation; large language models (LLMs) can handle many directly. A classifier determines whether a query needs retrieval based on whether the required knowledge exceeds the model’s parameters. Experiments on 15 tasks show a binary “sufficient/insufficient” labeling, and a trained classifier automates this decision.

Chunking

Chunk Size

Chunk size strongly influences performance. Larger chunks provide more context but increase latency; smaller chunks improve recall and speed but may lose context. We balance fidelity and relevance using Llamalndex metrics, with text‑embedding‑ada‑002 for embeddings, Zephyr‑7B‑alpha and GPT‑3.5‑Turbo as generation/evaluation models. Overlap of 20 tokens is used.

Chunking Techniques

Advanced methods such as “small‑to‑large” and sliding windows organize chunk relationships. Small chunks match queries; larger overlapping chunks provide context. Experiments with LLM‑Embedder (175‑token small, 512‑token large, 20‑token overlap) demonstrate improved retrieval quality.

Embedding Model Selection

Using FlagEmbedding’s evaluation on the MSMARCO datasets, LLM‑Embedder achieves comparable results to BAAI/bge‑large‑en while being one‑third the size, making it the preferred embedding model.

Metadata Augmentation

Adding metadata such as titles, keywords, and hypothesized questions to chunks can further improve retrieval and downstream processing; future work will explore this in depth.

Vector Database

Vector stores hold embeddings and metadata, offering various index types and hybrid (vector + keyword) search. Five open‑source options—Weaviate, Faiss, Chroma, Qdrant, and Milvus—are compared; Milvus consistently outperforms the others on scalability and feature coverage.

Retrieval Methods

Given a user query, the retriever selects the most similar documents using similarity scores. Three query‑transformation strategies are evaluated with LLM‑Embedder as encoder:

Query Rewriting : reformulate the query for better matching.

Query Decomposition : break the query into sub‑questions and retrieve for each.

Pseudo‑Document Generation : create a hypothetical document from the query and retrieve similar texts.

Hybrid search combining BM25 (sparse) and Contriever (dense) yields strong performance; HyDE‑augmented hybrid is recommended as the default.

Re‑ranking Methods

After initial retrieval, re‑ranking improves relevance. Two approaches are considered:

DLM Re‑ranking : fine‑tuned deep language model classifies documents as relevant or not.

TILDE Re‑ranking : predicts token‑level probabilities from a language model’s vocabulary.

Document Re‑wrapping

Document order affects downstream modules. Three re‑wrapping strategies—forward, reverse, and edge—are tested. Experiments show that placing the most relevant information at the beginning or end (edge) yields the best results.

Summarization

Long retrieved contexts can hinder generation. Both extractive (sentence scoring) and abstractive (multi‑document synthesis) summarizers are evaluated on NQ, TriviaQA, and HotpotQA. The Recomp tool provides the best trade‑off between quality and latency; LongLLMLingua performs poorly on training data but excels on unseen datasets.

Generator Fine‑tuning

Fine‑tuning focuses on the generator while keeping the retriever fixed. Backgrounds are categorized as: dgold: query‑relevant documents. drandom: random documents. Dg: only relevant background. Dr: one random document. Dgr: one relevant + one random. Dgg: two copies of a relevant document.

Base model is Llama‑2‑7B. Fine‑tuned variants (e.g., Mgr) that mix relevant and random backgrounds achieve the highest robustness and performance across QA benchmarks.

Finding the Best RAG Practices

Using the default modules identified in Section 3, we iteratively optimize each component. Experiments with a Milvus store (10 M English Wikipedia passages + 4 M medical texts) and a fine‑tuned Llama‑2‑7B‑Chat model reveal the impact of removing query classification, re‑ranking, or summarization.

Comprehensive Evaluation

Extensive experiments cover commonsense reasoning, fact‑checking, open‑domain QA, multi‑hop QA, and medical QA. Metrics include fidelity, background relevance, answer relevance, answer correctness, cosine similarity to gold documents, accuracy, token‑level F1, and exact match. The overall RAG score is the average of these five capabilities.

Results and Analysis

Query classification improves accuracy and reduces latency.

Hybrid with HyDE yields the highest RAG score but is computationally expensive; Hybrid or Original are recommended for efficiency.

Re‑ranking is crucial; MonoT5 provides the best relevance boost.

Reverse document re‑wrapping places relevant context near the query, improving results.

Recomp remains the preferred summarizer despite the latency trade‑off.

Implementation Best Practices

Performance‑maximizing : include query classification, use Hybrid with HyDE retrieval, MonoT5 re‑ranking, reverse re‑wrapping, and Recomp summarization.

Efficiency‑balanced : include query classification, Hybrid retrieval, TILDEv2 re‑ranking, reverse re‑wrapping, and Recomp summarization.

Multimodal Extension

RAG is extended to multimodal scenarios by adding text‑to‑image and image‑to‑text retrieval using large paired image‑text datasets. This approach ensures factual, detailed multimodal content and is planned to expand to video and speech modalities.

Conclusion

The study systematically identifies optimal practices for implementing RAG, evaluates each module’s alternatives, and proposes a comprehensive benchmark for future research. Findings deepen understanding of RAG systems and lay groundwork for further advances.

Limitations

Experiments are limited by the high cost of building large vector stores and the focus on a subset of chunking techniques. Future work will explore joint retriever‑generator training, broader chunking methods, and additional modalities such as audio and video.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM RAG vector database fine-tuning best practices Retrieval-Augmented Generation

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.