What Is Self‑RAG? A Simple Guide to Self‑Reflective Retrieval‑Augmented Generation

This article explains the motivation behind Self‑RAG, describes its core workflow—including conditional retrieval, enhanced generation, and self‑evaluation tokens—details the four evaluation metrics (Retrieve, IsRel, IsSup, IsUse), and provides a Python scoring example using log‑probabilities.

AI Large Model Application Practice
AI Large Model Application Practice
AI Large Model Application Practice
What Is Self‑RAG? A Simple Guide to Self‑Reflective Retrieval‑Augmented Generation

Why Self‑RAG?

Classic Retrieval‑Augmented Generation (RAG) often suffers from over‑retrieval and inconsistent outputs because it blindly retrieves the top‑K documents for every query, which can introduce irrelevant or contradictory information.

Self‑RAG was proposed by researchers from the University of Washington and IBM AI Research to give the language model the ability to decide whether retrieval is needed and to self‑assess its answers, thereby improving accuracy and efficiency.

Self‑RAG Workflow

The process consists of four steps:

Retrieval Judgment : The model first decides if external knowledge is required.

On‑Demand Retrieval : If needed, the system retrieves the most relevant Top_K documents; otherwise the model answers directly.

Enhanced Generation : Retrieved snippets are combined with the original query to form a prompt, producing K candidate answers.

Scoring & Selection : Each candidate is scored using four metrics and the highest‑scoring answer is returned.

Self‑RAG workflow diagram
Self‑RAG workflow diagram

Four Evaluation Metrics

Self‑RAG introduces four token‑based metrics that the model emits during generation:

Retrieve : Indicates whether retrieval is required ( [No Retrieval], [Retrieval], [Continue to Use Evidence]).

IsRel (Knowledge Relevance): Shows if the retrieved knowledge is relevant to the question ( [Relevant] or [Irrelevant]).

IsSup (Response Support): Measures how well the answer is supported by the retrieved knowledge ( [Fully supported], [Partially supported], [No support / Contradictory]).

IsUse (Response Utility): Rates the usefulness of the answer on a 1‑5 scale ( [Utility:x] where x ∈{1,…,5}).

These tokens are inserted directly into the model’s output, allowing downstream logic to quantify answer quality.

Generating the Metrics

One approach is to prompt the LLM to judge relevance and support, but this adds latency and cost. Self‑RAG instead fine‑tunes the model so that it emits the metric tokens autonomously—so‑called “self‑reflection tokens.”

Scoring with Log‑Probabilities

The model’s logprobs field contains the log‑probability of each candidate token at every generation step. By locating the position of a metric token and extracting its probability, the system can compute a normalized score.

Log‑probs illustration
Log‑probs illustration

Example: Computing IsSup Score

The following Python function demonstrates how to locate the [Fully supported] or [Partially supported] tokens, convert their log‑probabilities to normal probabilities, apply the weighting (0.5 for partial support), and return a final support score.

# Response support tokens
_IS_SUPPORTED_TOKENS = [
  "[Fully supported]",
  "[Partially supported]",
  "[No support / Contradictory]",
]

def calc_is_supported_score(pred_tokens, pred_log_probs_dict):
    # Find the index of the first support token
    token_appear_id = -1
    for tok_idx, token in enumerate(pred_tokens):
        if token in _IS_SUPPORTED_TOKENS:
            token_appear_id = tok_idx
            break
    if token_appear_id == -1:
        return 0.0
    # Gather probabilities for each support token at that position
    issup_score_dict = {}
    for token in _IS_SUPPORTED_TOKENS:
        prob = pred_log_probs_dict[token_appear_id][token]
        issup_score_dict[token] = np.exp(float(prob))
    # Compute weighted score (partial support weighted by 0.5)
    is_supported_score = (
        issup_score_dict["[Fully supported]"] +
        0.5 * issup_score_dict["[Partially supported]"]
    ) / sum(issup_score_dict.values())
    return is_supported_score

Similar formulas are used for IsRel (ratio of relevant token probability) and IsUse (weighted sum of utility tokens).

Practical Implementation

The open‑source Self‑RAG project provides a fine‑tuned LLaMA‑2 7B model ( selfrag_llama2_7b) that already emits these tokens. Users can integrate the model with frameworks such as vLLM or llama‑cpp, retrieve the pred_tokens and pred_log_probs_dict fields, and apply the scoring functions above to select the best answer.

In the next part of the series, the authors will demonstrate a full Self‑RAG application using the provided model.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonLLMmodel fine-tuningevaluation metricsLogprobsSelf-RAGRetrieval-Augmented Generation
AI Large Model Application Practice
Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.