Artificial Intelligence 14 min read

Why Your AI Defect Deduplication Returns Mixed Data and How to Fix It

This article details the challenges of building an AI‑powered defect deduplication system using Retrieval‑Augmented Generation, explains why LLMs produce composite (spliced) results, diagnoses the root cause as information loss in the RAG pipeline, and presents a step‑by‑step solution that restores atomicity of records for reliable duplicate detection.

Alibaba Cloud Developer

Aug 21, 2025

Why Your AI Defect Deduplication Returns Mixed Data and How to Fix It

Background: The Challenge of Duplicate Defect Management

In proprietary cloud product development, identifying and managing duplicate defects is costly and time‑consuming because testers and developers struggle to efficiently search a large historical defect database.

Goal: Build an RAG‑Based Automated Defect Analysis Expert

The system uses Retrieval‑Augmented Generation (RAG) to index historical defects (title, description, module, version, etc.) into a vector database for semantic search, then lets a large language model (LLM) compare new defects with the most similar historical ones and output a JSON report.

Core Architecture

Knowledge‑base construction: store fields such as title, description, module, version as vectors.

Intelligent retrieval: extract semantic embeddings from a new defect and perform similarity search to retrieve candidate records.

LLM analysis & generation: feed the new defect and retrieved candidates to the LLM, prompting it to act as an experienced QA expert and produce a strict JSON output with duplication flag, similarity score, and the full fields of the most similar defect.

Problem Encountered: Unresolvable "Data Splicing"

Even with a carefully crafted prompt, the model returned a "most similar defect" whose fields (ID, title, description, module) came from different historical records, creating a composite entity that does not exist in the source data.

Initial Diagnosis: Prompt Engineering Failure

The first hypothesis was that the LLM treated the context as a flat pool of information and assembled the best match for each field independently, leading to cross‑record splicing.

Debugging Steps and Revised Prompt

Introduce a two‑stage workflow: first identify a single best matching record, then extract all required fields from that record only.

Emphasize record indivisibility with phrasing such as "treat as an indivisible whole" and "the unique champion record".

Add a strict constraint that all output fields must originate from the same historical defect.

Root Cause Analysis: Information Fragmentation in RAG

The real issue lies in the RAG pipeline:

Indexing stage loss: only title and description are vectorized; structured metadata (ID, module, version, etc.) is stored separately and can become detached from the text chunks.

Retrieval stage fragmentation: the retriever returns several text fragments without their associated metadata, so the LLM receives incomplete context.

Generation stage hallucination: lacking the missing fields, the LLM fills them with hallucinated or mixed data.

Solution: Preserve Atomicity of Records

Configure the index to include all relevant structured fields (ID, module, project, version, status) as part of each vector record, ensuring that every retrieved chunk carries its full metadata dictionary.

After this change, the LLM receives complete, coherent records, and the prompt works as intended, producing accurate, non‑spliced duplicate‑detection results.

Results

{
  "is_duplicate": true,
  "similarity_score": 95,
  "justification": "The new defect's description and title match historical defect #bug2 in core issue, reproduction steps, and environment, all within module zz and ADx.0 instance creation interruption.",
  "most_similar_defect": {
    "缺陷ID": "bug2",
    "标题": "【adbx.0】mm环境创建adbx.0的实例会异常中断",
    "描述": "[缺陷描述]：xxxxxxxxxxxx描述",
    "模块": "zz",
    "归属项目": "318x",
    "版本": "v3.yy.y",
    "状态": "Open"
  }
}

Takeaways

In enterprise RAG applications, data engineering (proper indexing and metadata handling) is more critical than prompt tuning.

Maintaining record atomicity throughout the pipeline is essential for reliable generation.

When faced with LLM hallucinations, treat the RAG system as an end‑to‑end workflow and debug each stage systematically.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Prompt Engineering RAG vector database Knowledge Base AI debugging defect deduplication

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.