Mastering RAG Evaluation: Recall@K, MRR, NDCG, and RAGAS Explained

This article breaks down RAG evaluation into a two‑layer framework, explains the four core metrics—Recall@K, MRR, NDCG, and the four RAGAS scores—shows how to implement them with LangChain.js, highlights common pitfalls, and offers scenario‑specific metric combinations for reliable performance monitoring.

James' Growth Diary
James' Growth Diary
James' Growth Diary
Mastering RAG Evaluation: Recall@K, MRR, NDCG, and RAGAS Explained

01 RAG Evaluation’s Two‑Layer Structure: Retrieval vs. Generation

RAG consists of a retrieval stage (Retriever) that returns document chunks and a generation stage (Generator) that produces the final answer. Problems in each layer are independent: the retriever may return irrelevant or poorly ranked documents, while the generator may hallucinate or ignore useful context. Therefore, evaluation must be split accordingly.

RAG evaluation two‑layer structure: separate retrieval and generation metrics
RAG evaluation two‑layer structure: separate retrieval and generation metrics

02 Recall@K – Is the answer among the top K results?

Recall@K measures the proportion of relevant documents that appear in the top K retrieved items. It is computed as

Recall@K = (number of relevant docs in top K) / (total relevant docs)

. Example: if 4 of 5 relevant docs appear in the top 10, Recall@10 = 0.8.

function recallAtK(retrieved: string[], relevant: string[], k: number): number {
  const topK = retrieved.slice(0, k);
  const relevantSet = new Set(relevant);
  const hits = topK.filter(id => relevantSet.has(id)).length;
  return relevant.length === 0 ? 0 : hits / relevant.length;
}

function avgRecallAtK(queries: Array<{retrieved: string[]; relevant: string[]}>, k: number): number {
  const scores = queries.map(q => recallAtK(q.retrieved, q.relevant, k));
  return scores.reduce((a, b) => a + b, 0) / scores.length;
}

const testData = [
  {retrieved: ["doc_3","doc_7","doc_1","doc_5","doc_9"], relevant: ["doc_1","doc_3","doc_5"]},
  {retrieved: ["doc_2","doc_8","doc_4","doc_6","doc_10"], relevant: ["doc_2","doc_4"]}
];
console.log(`Recall@3: ${avgRecallAtK(testData, 3).toFixed(3)}`); // 0.667
console.log(`Recall@5: ${avgRecallAtK(testData, 5).toFixed(3)}`); // 1.000

Recall@K ignores ranking order; a relevant document at position 10 counts the same as one at position 1. This blind spot motivates the use of MRR.

03 MRR – Where does the first relevant document appear?

Mean Reciprocal Rank (MRR) focuses on the rank of the first relevant document: MRR = (1/|Q|) × Σ (1 / rank_i). The reciprocal gives a score of 1.0 for rank 1, 0.5 for rank 2, etc., heavily penalising later hits.

function reciprocalRank(retrieved: string[], relevant: string[]): number {
  const relevantSet = new Set(relevant);
  for (let i = 0; i < retrieved.length; i++) {
    if (relevantSet.has(retrieved[i])) return 1 / (i + 1);
  }
  return 0;
}

function meanReciprocalRank(queries: Array<{retrieved: string[]; relevant: string[]}>) {
  const scores = queries.map(q => reciprocalRank(q.retrieved, q.relevant));
  return scores.reduce((a, b) => a + b, 0) / scores.length;
}

const strategyA = [
  {retrieved: ["doc_x","doc_1","doc_3"], relevant: ["doc_1"]}, // relevant at rank 2
  {retrieved: ["doc_2","doc_y","doc_z"], relevant: ["doc_2"]}  // relevant at rank 1
];
const strategyB = [
  {retrieved: ["doc_1","doc_x","doc_3"], relevant: ["doc_1"]}, // rank 1
  {retrieved: ["doc_y","doc_2","doc_z"], relevant: ["doc_2"]}  // rank 2
];
console.log(`Strategy A MRR: ${meanReciprocalRank(strategyA).toFixed(3)}`); // 0.750
console.log(`Strategy B MRR: ${meanReciprocalRank(strategyB).toFixed(3)}`); // 0.750

MRR cannot distinguish between the two strategies above because it only looks at the first relevant hit; NDCG is needed for finer ranking discrimination.

04 NDCG – Comprehensive ranking quality

Normalized Discounted Cumulative Gain (NDCG) accounts for both relevance grades and position. It computes DCG@K = Σ (rel_i / log₂(i+1)) and normalises by the ideal DCG (IDCG). Scores range from 0 to 1.

function dcgAtK(relevances: number[], k: number): number {
  return relevances.slice(0, k).reduce((acc, rel, i) => acc + rel / Math.log2(i + 2), 0);
}

function ndcgAtK(retrieved: string[], relevanceMap: Map<string, number>, k: number): number {
  const retrievedRels = retrieved.slice(0, k).map(id => relevanceMap.get(id) ?? 0);
  const idealRels = Array.from(relevanceMap.values()).sort((a, b) => b - a);
  const dcg = dcgAtK(retrievedRels, k);
  const idcg = dcgAtK(idealRels, k);
  return idcg === 0 ? 0 : dcg / idcg;
}

const relevanceScores = new Map([
  ["doc_1", 2], // highly relevant
  ["doc_3", 1], // partially relevant
  ["doc_5", 2]
]);
const retrievalA = ["doc_x","doc_y","doc_1","doc_3","doc_5"]; // relevant docs later
const retrievalB = ["doc_1","doc_5","doc_3","doc_x","doc_y"]; // relevant docs early
console.log(`Strategy A NDCG@5: ${ndcgAtK(retrievalA, relevanceScores, 5).toFixed(3)}`);
console.log(`Strategy B NDCG@5: ${ndcgAtK(retrievalB, relevanceScores, 5).toFixed(3)}`);

NDCG is preferred when the knowledge base contains multiple relevant documents with varying relevance levels.

05 RAGAS – Generation‑layer metrics

RAGAS (Retrieval‑Augmented Generation Assessment) provides four scores that evaluate how the generator uses retrieved context:

Faithfulness : proportion of answer statements supported by the context (detects hallucination).

Answer Relevancy : whether the answer actually addresses the query.

Context Precision : fraction of retrieved documents that are useful.

Context Recall : coverage of information required by the ground‑truth answer.

All four numbers must be high for a healthy RAG system.

import { ChatOpenAI } from "@langchain/openai";
import { PromptTemplate } from "@langchain/core/prompts";

const llm = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });

// Faithfulness
async function faithfulness(context: string[], answer: string): Promise<number> {
  const prompt = PromptTemplate.fromTemplate(`
You are a strict fact‑checker.
Context: {context}
Answer: {answer}

Return JSON {"supported_claims": X, "total_claims": Y}
`);
  const chain = prompt.pipe(llm);
  const result = await chain.invoke({ context: context.join("

"), answer });
  const parsed = JSON.parse(result.content as string);
  return parsed.supported_claims / parsed.total_claims;
}

// Context Precision (simplified example)
async function contextPrecision(query: string, contexts: string[], groundTruth: string): Promise<number> {
  const prompt = PromptTemplate.fromTemplate(`
Question: {query}
Ground truth: {groundTruth}
Document fragment: {context}
Is this fragment useful? Return {"useful": true/false}
`);
  const chain = prompt.pipe(llm);
  let usefulCount = 0;
  let precisionSum = 0;
  for (let i = 0; i < contexts.length; i++) {
    const result = await chain.invoke({ query, groundTruth, context: contexts[i] });
    const { useful } = JSON.parse(result.content as string);
    if (useful) {
      usefulCount++;
      precisionSum += usefulCount / (i + 1);
    }
  }
  return usefulCount === 0 ? 0 : precisionSum / usefulCount;
}

06 Full Evaluation Pipeline

To turn metrics into actionable reports, you need a dataset of (query, contexts, answer, ground_truth) tuples. Two ways to build it: manual annotation (high quality, slow) or LLM‑generated synthetic QA pairs.

import { ChatOpenAI } from "@langchain/openai";
import { Document } from "@langchain/core/documents";

// Step 1: LLM‑generated synthetic dataset
async function synthesizeEvalDataset(docs: Document[], numQuestions = 20) {
  const llm = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0.3 });
  const samples = [];
  for (let i = 0; i < numQuestions; i++) {
    const doc = docs[Math.floor(Math.random() * docs.length)];
    const prompt = `Based on the following document, generate a deep question and its reference answer (JSON):
Document: ${doc.pageContent}
Output:`;
    const result = await llm.invoke(prompt);
    const { question, answer } = JSON.parse(result.content as string);
    samples.push({ query: question, groundTruth: answer });
  }
  return samples;
}

// Step 2: Run evaluation and aggregate results
async function runRAGEval(ragChain: any, evalDataset: Array<{query: string; groundTruth: string; relevantDocIds: string[]}>, vectorStore: any) {
  const results = [];
  for (const sample of evalDataset) {
    const ragOutput = await ragChain.invoke({ query: sample.query });
    const retrievedDocs = await vectorStore.similaritySearch(sample.query, 10);
    const retrievedIds = retrievedDocs.map(d => d.metadata.id as string);
    const recall5 = recallAtK(retrievedIds, sample.relevantDocIds, 5);
    const mrr = reciprocalRank(retrievedIds, sample.relevantDocIds);
    const faith = await faithfulness(ragOutput.contexts, ragOutput.answer);
    results.push({ recall5, mrr, faithfulness: faith });
  }
  const avg = (key: string) => (results.reduce((a, b) => a + (b as any)[key], 0) / results.length).toFixed(3);
  console.log("
===== RAG Evaluation Report =====");
  console.log(`Recall@5:   ${avg("recall5")}`);
  console.log(`MRR:        ${avg("mrr")}`);
  console.log(`Faithfulness: ${avg("faithfulness")}`);
  if (parseFloat(avg("recall5")) < 0.7) console.log("⚠️  Recall@5 low → consider larger TopK or hybrid retrieval");
  if (parseFloat(avg("faithfulness")) < 0.8) console.log("⚠️  Faithfulness low → LLM may hallucinate, tighten system prompt");
}

07 Common Pitfalls (5 Typical Mistakes)

Using production logs as test data without ground truth – Context Recall cannot be computed.

Setting K too large – Recall@20 may look perfect while you only feed Top 5 to the LLM.

Focusing solely on Faithfulness while ignoring Context Precision/Recall – clean context is required for accurate answers.

Applying BLEU/ROUGE to semantic RAG tasks – they ignore meaning; use RAGAS instead.

Never updating the evaluation set after knowledge‑base changes – leads to stale metrics and hidden regressions.

08 Scenario‑Specific Metric Combinations

Different RAG use‑cases prioritize different metrics:

QA (single answer) : MRR + Faithfulness.

Knowledge‑base search (multiple docs) : NDCG@10 + Context Precision.

Chatbot / customer‑service : Recall@5 + Answer Relevancy + Faithfulness.

Compliance / internal docs : Faithfulness (primary) + Context Precision.

Conclusion

The article decomposes RAG evaluation into retrieval and generation layers, explains the progressive nature of Recall@K, MRR, and NDCG, stresses the critical role of Faithfulness, and shows how Context Precision and Faithfulness must be examined together. It also provides a runnable pipeline, outlines five real‑world pitfalls, and maps metrics to concrete scenarios, preparing practitioners to monitor and improve RAG systems systematically.

RAG evaluation metric overview
RAG evaluation metric overview
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LangChainRAGevaluationNDCGMRRRAGASRecall@K
James' Growth Diary
Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.