Mastering RAG Post‑Launch: A Closed‑Loop Badcase Management Blueprint
This article explains how to establish a six‑step closed‑loop workflow for operating RAG‑based question‑answer systems in insurance, covering badcase collection via three channels, four‑type classification, automated scripts, regression testing, gray‑scale rollout, and real‑world metrics that boosted answer accuracy from 76 % to 89 %.
Why Post‑Launch Operations Matter More Than Development
Deploying a Retrieval‑Augmented Generation (RAG) system is only the first step; in high‑risk domains like insurance, continuous operation is the real challenge.
Project background: a mid‑size insurer indexed 5,000 insurance contracts and claim documents into a RAG QA system, allowing customers to ask policy‑specific questions such as "What is the maximum payout for my car accident?".
Before launch the system achieved 76% accuracy on a handcrafted 200‑question test set, but within the first week real users exposed critical gaps.
Typical Failure Scenarios Observed
Three major problems emerged:
Users asked questions not covered by the test set, often in colloquial or dialectal language, leading to retrieval failures.
Compliance risks surfaced when the model returned an incorrect payout amount, prompting legal review.
Knowledge base drift occurred as insurance products were updated without a mechanism to refresh the indexed content.
These issues highlighted the need for a systematic badcase (failure case) handling loop.
Four Badcase Categories
We classify every badcase into one of four types, each requiring a distinct remediation path.
Retrieval Failure (≈40%) : The answer exists in the knowledge base but the vector search does not retrieve it. Root causes : poor embedding recall for colloquial terms and sub‑optimal chunking that splits key information. Fix : improve chunking (semantic boundaries), combine BM25 with dense retrieval, and re‑rank results.
Hallucination Generation (≈25%) : Retrieval returns correct documents, yet the LLM generates a wrong answer. Root cause : the model injects memorized industry facts when prompts are insufficiently constrained. Fix : strengthen prompt fidelity, enforce “use only retrieved documents” rule, and require citation of source passages.
Routing Error (≈20%) : The query should be handled by the RAG pipeline but is mistakenly sent to another module (e.g., Text‑2‑SQL). Root cause : insufficient training data for the routing classifier. Fix : augment routing training set with mis‑routed examples and fine‑tune the classifier or routing prompt.
Knowledge Gap (≈15%) : The knowledge base truly lacks the required information. Fix : identify missing documents and have the knowledge‑ops team ingest the missing policies or guides.
Three‑Channel Badcase Collection
Relying solely on manual monitoring is infeasible; we therefore deploy three parallel collection channels.
User Feedback Buttons : Each answer UI includes thumbs‑up and thumbs‑down buttons. A down‑vote automatically queues the conversation as a candidate badcase, optionally with a short reason (e.g., "incorrect answer").
Customer‑Support Tickets : When support agents tag a ticket with "AI answer issue", the associated dialogue is extracted and stored as a badcase.
Automated Quality Detection : Every conversation is scored on three dimensions – retrieval relevance, answer faithfulness, and key‑information completeness. Scores below predefined thresholds flag the dialogue as a suspected badcase.
Automated Badcase Classification Script
Manual labeling is too slow; we built an LLM‑driven classifier that consumes the user query, the top‑3 retrieved document snippets, and the generated answer.
def classify_badcase(query: str, retrieved_docs: list, answer: str) -> dict:
"""Use LLM to automatically classify a Badcase type"""
classification_prompt = f"""
Analyze the following RAG failure case and determine the failure type:
User question: {query}
Retrieved docs (Top3): {[d['text'][:200] for d in retrieved_docs[:3]]}
System answer: {answer}
Choose one type:
A. Retrieval failure
B. Hallucination generation
C. Routing error
D. Knowledge gap
Output JSON: {{"type": "A/B/C/D", "reason": "...", "fix_direction": "..."}}
"""
result = llm.invoke(classification_prompt)
return json.loads(result.content)The classifier first checks for missing evidence (knowledge gap), then compares answer vs. retrieved text (hallucination), then verifies routing, and finally defaults to retrieval failure. In practice it achieves ~80% accuracy, with low‑confidence cases (≈15%) sent for human review.
Six‑Step Badcase Operational Loop
Collect : Aggregate badcases from the three channels into a central database each day.
Auto‑Classify : Run the classification script; high‑risk cases (e.g., involving payout amounts) are flagged for manual verification.
Assign : Route each case to the responsible team – retrieval engineers, prompt engineers, routing model owners, or knowledge‑ops.
Fix & Verify : Before merging a fix, re‑run the original badcase to ensure the answer is now correct.
Regression Test : Execute a full‑scale evaluation on the test set, enforcing that key metrics do not degrade beyond a 2% tolerance.
def regression_test(old_config: dict, new_config: dict, test_set: list) -> dict:
"""Compare new and old configs on the test set to ensure no degradation"""
old_scores = run_evaluation(old_config, test_set)
new_scores = run_evaluation(new_config, test_set)
checks = {
"recall@5": new_scores["recall@5"] >= old_scores["recall@5"] - 0.02,
"faithfulness": new_scores["faithfulness"] >= old_scores["faithfulness"] - 0.02,
"answer_accuracy": new_scores["answer_accuracy"] >= old_scores["answer_accuracy"] - 0.02,
}
passed = all(checks.values())
return {"passed": passed, "details": checks, "regression_cases": find_regressions(old_scores, new_scores)}The three core metrics map to the three failure risks: recall@5 (retrieval), faithfulness (hallucination), answer_accuracy (overall quality).
Gray‑Scale Release : Deploy the change to 10% of users for one week, monitor feedback rate, support tickets, and automated quality scores before a full rollout.
Six‑Month Production Results
After implementing the loop, the system’s answer accuracy rose from 76% at launch to 89% after six months. Badcase volume dropped from ~50 per week in month 1 to ~15 per week by month 3. The test set grew from 200 to 350 cases, with 150 real‑world badcases incorporated, dramatically improving evaluation fidelity.
Interview‑Ready Answer Framework
When asked "How do you operate a RAG system after launch?", cover four layers:
Collection – mention the three parallel channels and their coverage differences.
Classification – describe the four‑type framework and the LLM‑assisted automatic classifier with human review for high‑risk cases.
Verification – explain the two‑step validation (original badcase + full regression test) and the specific metrics with a 2% tolerance.
Closed‑Loop – outline the six‑step workflow, cite the 76%→89% accuracy improvement and the reduction in weekly badcases.
Sample concise answer: "We built three badcase collection channels (user feedback, support tickets, automated detection), classify them into retrieval‑failure, hallucination, routing error, and knowledge‑gap, assign each to the appropriate team, validate fixes on the original case then run regression tests ensuring recall@5, faithfulness, and answer_accuracy stay within 2% of baseline, and finally gray‑scale release. In six months this raised accuracy from 76% to 89% and cut weekly badcases from 50 to 15."
Conclusion
The true competitive edge of a RAG system lies not in the launch snapshot but in the ability to continuously iterate based on real‑world badcases. A robust operational loop—stable collection, systematic classification, and rigorous regression testing—turns user‑reported failures into valuable data that steadily improves system quality.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
