Why Your AI Defect Deduplication Returns Mixed Data and How to Fix It
This article details the challenges of building an AI‑powered defect deduplication system using Retrieval‑Augmented Generation, explains why LLMs produce composite (spliced) results, diagnoses the root cause as information loss in the RAG pipeline, and presents a step‑by‑step solution that restores atomicity of records for reliable duplicate detection.
Background: The Challenge of Duplicate Defect Management
In proprietary cloud product development, identifying and managing duplicate defects is costly and time‑consuming because testers and developers struggle to efficiently search a large historical defect database.
Goal: Build an RAG‑Based Automated Defect Analysis Expert
The system uses Retrieval‑Augmented Generation (RAG) to index historical defects (title, description, module, version, etc.) into a vector database for semantic search, then lets a large language model (LLM) compare new defects with the most similar historical ones and output a JSON report.
Core Architecture
Knowledge‑base construction: store fields such as title, description, module, version as vectors.
Intelligent retrieval: extract semantic embeddings from a new defect and perform similarity search to retrieve candidate records.
LLM analysis & generation: feed the new defect and retrieved candidates to the LLM, prompting it to act as an experienced QA expert and produce a strict JSON output with duplication flag, similarity score, and the full fields of the most similar defect.
Problem Encountered: Unresolvable "Data Splicing"
Even with a carefully crafted prompt, the model returned a "most similar defect" whose fields (ID, title, description, module) came from different historical records, creating a composite entity that does not exist in the source data.
Initial Diagnosis: Prompt Engineering Failure
The first hypothesis was that the LLM treated the context as a flat pool of information and assembled the best match for each field independently, leading to cross‑record splicing.
Debugging Steps and Revised Prompt
Introduce a two‑stage workflow: first identify a single best matching record, then extract all required fields from that record only.
Emphasize record indivisibility with phrasing such as "treat as an indivisible whole" and "the unique champion record".
Add a strict constraint that all output fields must originate from the same historical defect.
Root Cause Analysis: Information Fragmentation in RAG
The real issue lies in the RAG pipeline:
Indexing stage loss: only title and description are vectorized; structured metadata (ID, module, version, etc.) is stored separately and can become detached from the text chunks.
Retrieval stage fragmentation: the retriever returns several text fragments without their associated metadata, so the LLM receives incomplete context.
Generation stage hallucination: lacking the missing fields, the LLM fills them with hallucinated or mixed data.
Solution: Preserve Atomicity of Records
Configure the index to include all relevant structured fields (ID, module, project, version, status) as part of each vector record, ensuring that every retrieved chunk carries its full metadata dictionary.
After this change, the LLM receives complete, coherent records, and the prompt works as intended, producing accurate, non‑spliced duplicate‑detection results.
Results
{
"is_duplicate": true,
"similarity_score": 95,
"justification": "The new defect's description and title match historical defect #bug2 in core issue, reproduction steps, and environment, all within module zz and ADx.0 instance creation interruption.",
"most_similar_defect": {
"缺陷ID": "bug2",
"标题": "【adbx.0】mm环境创建adbx.0的实例会异常中断",
"描述": "[缺陷描述]:xxxxxxxxxxxx描述",
"模块": "zz",
"归属项目": "318x",
"版本": "v3.yy.y",
"状态": "Open"
}
}Takeaways
In enterprise RAG applications, data engineering (proper indexing and metadata handling) is more critical than prompt tuning.
Maintaining record atomicity throughout the pipeline is essential for reliable generation.
When faced with LLM hallucinations, treat the RAG system as an end‑to‑end workflow and debug each stage systematically.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
