How to Build a Robust RAG Evaluation Framework for Finance Q&A

This guide outlines a five‑dimensional evaluation system—accuracy, credibility, latency, scalability, and user experience—providing concrete metrics, code examples, and practical steps to assess Retrieval‑Augmented Generation models in financial insurance question‑answering scenarios.

Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
How to Build a Robust RAG Evaluation Framework for Finance Q&A

In large enterprises and financial institutions, simply getting a Retrieval‑Augmented Generation (RAG) project to run is insufficient; a systematic, quantitative evaluation framework is essential. The article presents a comprehensive five‑dimensional evaluation scheme tailored for finance‑insurance QA systems.

Why an Evaluation Framework Matters

The core of a RAG system is balancing retrieval and generation. Production‑ready systems must answer five key questions: correctness, credibility, speed, scalability, and user satisfaction. These correspond to recall/accuracy, trustworthiness, latency, extensibility, and user experience.

1. Recall and Accuracy: Ensuring Correct Answers

Two primary aspects are measured:

Answer Accuracy : Compare model output to a reference answer.

Retrieval Recall : Verify that the system retrieves documents containing the correct answer.

Standard NLP metrics are used:

BLEU : n‑gram overlap.

ROUGE : recall coverage.

MRR (Mean Reciprocal Rank): rank of the first correct answer.

Top‑k Recall : whether the correct document appears in the top k results.

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge import Rouge

ref = "您的汽车保险可赔偿医疗费用、车辆维修费,以及第三方损害赔偿。"
gen = "您的保单通常涵盖车祸后的医疗费用、车辆损失,以及对第三方的赔偿。"

chencherry = SmoothingFunction()
bleu = sentence_bleu([list(ref)], list(gen), smoothing_function=chencherry.method1)
rouge = Rouge().get_scores(gen, ref)
print(f"BLEU: {bleu:.3f}, ROUGE-1 F1: {rouge[0]['rouge-1']['f']:.3f}")

If BLEU > 0.6 and ROUGE‑1 F1 > 0.7, the answer is considered sufficiently similar to the reference. Combined with retrieval MRR and Top‑3 recall, this indicates overall pipeline correctness.

2. Credibility: Answers Must Be Grounded

Two metrics assess grounding:

Answer‑Document Matching : degree to which key answer information appears in retrieved documents.

Document Coverage : proportion of answer tokens found in the supporting corpus.

generated = "您的保单通常涵盖车祸后的医疗费用、车辆损失,以及对第三方的赔偿。"
docs = [
    "根据保险条款,医疗费用和车辆损失在车祸理赔中可以获得赔偿。",
    "如果您对第三方造成损害,保险也会提供相应的赔付。"
]

tokens = lambda x: [c for c in x if c.strip()]
a, d = set(tokens(generated)), set(tokens("".join(docs)))
coverage = len(a & d) / len(a)
print(f"支持文档覆盖率: {coverage:.2f}")

A coverage > 0.7 suggests most answer content is sourced from retrieved documents, indicating high credibility. Advanced systems may add vector similarity or explicit citation checks.

3. Latency: Measuring Speed and Stability

Financial users prioritize fast responses. Common latency metrics include average response time, P95, and P99 tail latencies.

import random
times = [random.uniform(0.1, 0.3) for _ in range(100)]
avg = sum(times) / len(times)
p95 = sorted(times)[int(0.95 * len(times)) - 1]
p99 = sorted(times)[int(0.99 * len(times)) - 1]
print(f"平均: {avg:.3f}s, P95: {p95:.3f}s, P99: {p99:.3f}s")

Typical output: "平均: 0.200s, P95: 0.280s, P99: 0.290s". If P99 is significantly higher than average, bottlenecks such as slow retrieval or model inference should be investigated.

4. Scalability: Handling Larger Data and Traffic

Tests examine how response time and throughput change with increasing document volume and query concurrency.

import time
sizes = [1000, 10000, 100000]
for s in sizes:
    data = list(range(s))
    q = 100
    start = time.time()
    for _ in range(q):
        _ = (s+1) in data
    total = time.time() - start
    print(f"数据量: {s}, 平均耗时: {total/q*1000:.3f}ms, 吞吐量: {q/total:.1f}/s")

As size grows, latency rises and QPS drops, revealing whether indexing structures or caching strategies are efficient.

5. User Experience: Human‑Centric Evaluation

Two aspects are measured:

User satisfaction (manual rating or feedback ratio).

Answer readability (e.g., Flesch Reading Ease for English or sentence length metrics for Chinese).

ratings = [5,4,5,3,4,4,5]
avg = sum(ratings) / len(ratings)
print(f"用户满意度平均: {avg:.2f}/5")

Output: "用户满意度平均: 4.29/5". Higher scores indicate a pleasant experience; readability metrics help avoid jargon‑heavy responses.

Putting It All Together

The five dimensions form a closed loop: recall/accuracy guarantees correct answers, credibility ensures they are grounded, latency provides speed, scalability confirms the system can handle growth, and user experience validates overall usability. Together they determine whether a RAG system is production‑ready.

Interview‑Ready Answer

"We use a five‑dimensional evaluation framework covering accuracy (BLEU, ROUGE, MRR, Top‑k recall), credibility (answer‑document matching and coverage), performance (average latency, P95/P99), scalability (throughput across data scales), and user experience (satisfaction scores and readability). This lets us pinpoint bottlenecks and reliably deploy RAG in production."
RAG evaluation diagram
RAG evaluation diagram
AIRAGfinanceretrieval-augmented-generation
Wu Shixiong's Large Model Academy
Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.