RAG Level 1: Avoid Dirty Data Poisoning Your AI – A Data Cleaning Guide

This article explains why noisy documents cripple Retrieval‑Augmented Generation, enumerates common garbage data types, describes three typical data‑quality problems, warns against over‑cleaning, encoding, and regex pitfalls, and provides a configurable LangChain pipeline with deduplication and validation best practices.

AI Architect Hub
AI Architect Hub
AI Architect Hub
RAG Level 1: Avoid Dirty Data Poisoning Your AI – A Data Cleaning Guide

Problem Scenario

A chatbot built for customer service returns an unrelated policy when asked "How to reset password". The root cause is not the language model but noisy data injected during PDF parsing—headers, footers, watermarks, and repeated copyright notices become part of the embeddings.

Garbage in, garbage out.

Impact of Dirty Data on Embeddings

Embedding models generate vectors from the semantic content of the input text. When the input contains noisy fragments, the resulting vector mixes useful semantics (e.g., password‑management knowledge) with useless fragments (page numbers, navigation links, legal boilerplate, copyright notices). This dilutes the target concept and harms retrieval accuracy.

Typical Noise Types

Header/Footer – repeated page markers such as "公司保密文件 第3页".

Watermark text – words like "机密", "Draft" that mislead semantic judgment.

Navigation menu – strings like "[首页][返回][联系我们]" that add meaningless commands.

Copyright notice – repetitive statements like "© 2024 XXX Corp." that dilute the theme.

HTML tags – raw tags (e.g., <div><span class="xxx">) that pollute pure text semantics.

OCR garble – unreadable characters (e.g., □■■�) that cause vector anomalies.

Duplicate paragraphs – repeated content that over‑weights retrieval results.

Encoding issues – invisible characters such as \u200b, \ufeff, or BOM headers that interfere with downstream processing.

Three Typical "Diseases" in Document Structure

Structure loss : Markdown headings (e.g., "# Model Deployment") collapse into a flat text block, causing unrelated sections to mix during chunking.

Table fragmentation : Tabular data is split into separate lines, destroying column relationships.

Semantic fragmentation : A complete FAQ is broken into multiple fragments, so answers become partial.

Common Cleaning Traps

Trap 1 – Over‑cleaning (throwing away the gold)

Removing all non‑alphanumeric characters also deletes useful punctuation and numbers. Example of a bad practice:

# Bad practice: delete all non‑alphanumeric characters
text = re.sub(r'[^a-zA-Z0-9\u4e00-\u9fa5]', '', text)
# Result: "Python >= 3.8, 建议使用 >= 3.10 版本"
# → "Python  38  建议使用  310 版本"

Correct approach: delete only control characters.

# Good practice: delete control characters only
text = re.sub(r'[\x00-\x1f\x7f-\x9f]', '', text)

Stop‑word removal should be experimented with because words like "不", "或", "必须" remain semantically important for embedding retrieval.

Trap 2 – Encoding Pitfalls (invisible traps)

Common invisible issues:

# BOM causes JSON parsing failure
text = "\ufeff{\"name\": \"张三\"}"
# Full‑width vs half‑width characters
"Python=Python" vs "Python=Python"
# Mixed encodings (UTF‑8 + GBK)

Robust handling:

import unicodedata

def normalize_encoding(text):
    # Unicode NFC normalization
    text = unicodedata.normalize('NFC', text)
    # Remove BOM
    text = text.replace('\ufeff', '')
    return text

Trap 3 – Regex Overkill (good text gets killed)

A regex that deletes any line consisting solely of digits also removes meaningful years or protocol names.

# Bad: delete any line that is only digits
text = re.sub(r'^\d+$', '', text, flags=re.MULTILINE)
# Result: "2024", "HTTP/2" are removed

Safer practice: combine regex with contextual checks rather than a blind one‑liner.

Code Demo – LangChain‑Based Cleaning Pipeline

The following configurable pipeline implements the concepts above.

"""RAG Document Cleaning Pipeline – Level 1 code"""

import re, hashlib, logging
from dataclasses import dataclass, field
from typing import List, Callable, Optional
from langchain_core.documents import Document

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@dataclass
class CleaningConfig:
    """Cleaning configuration options"""
    remove_html: bool = True
    remove_urls: bool = True
    remove_page_numbers: bool = True
    remove_headers_footers: bool = True
    remove_copyright: bool = True
    normalize_whitespace: bool = True
    normalize_encoding: bool = True
    remove_control_chars: bool = True
    custom_stop_sentences: List[str] = field(default_factory=list)
    custom_replace_rules: dict = field(default_factory=dict)

class DocumentCleaner:
    """Document cleaner"""

    def __init__(self, config: Optional[CleaningConfig] = None):
        self.config = config or CleaningConfig()
        self.processing_log: List[dict] = []

    def clean(self, documents: List[Document]) -> List[Document]:
        cleaned_docs = []
        for doc in documents:
            original_text = doc.page_content
            cleaned_text = self._clean_text(original_text, doc.metadata)
            if self._is_valid_content(cleaned_text):
                doc.page_content = cleaned_text
                cleaned_docs.append(doc)
                logger.debug(f"Cleaned: {len(original_text)} → {len(cleaned_text)} chars")
            else:
                logger.warning(f"Skipped short content: {doc.metadata.get('source', 'unknown')}")
        return cleaned_docs

    def _clean_text(self, text: str, metadata: dict) -> str:
        # 1. Encoding normalization
        if self.config.normalize_encoding:
            text = self._normalize_encoding(text)
        # 2. Control character removal
        if self.config.remove_control_chars:
            text = self._remove_control_characters(text)
        # 3. HTML tag removal
        if self.config.remove_html:
            text = self._remove_html_tags(text)
        # 4. URL removal
        if self.config.remove_urls:
            text = self._remove_urls(text)
        # 5. Page number removal
        if self.config.remove_page_numbers:
            text = self._remove_page_numbers(text)
        # 6. Header/footer removal
        if self.config.remove_headers_footers:
            text = self._remove_headers_footers(text, metadata)
        # 7. Copyright removal
        if self.config.remove_copyright:
            text = self._remove_copyright(text)
        # 8. Whitespace normalization
        if self.config.normalize_whitespace:
            text = self._normalize_whitespace(text)
        # 9. Custom rules
        text = self._apply_custom_rules(text)
        return text.strip()

    def _normalize_encoding(self, text: str) -> str:
        import unicodedata
        text = unicodedata.normalize('NFC', text)
        text = text.replace('\ufeff', '')
        return text

    def _remove_control_characters(self, text: str) -> str:
        return re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]', '', text)

    def _remove_html_tags(self, text: str) -> str:
        text = re.sub(r'<[^>]+>', ' ', text)
        text = re.sub(r'&[a-zA-Z]+;', ' ', text)
        text = re.sub(r'&#\d+;', ' ', text)
        return text

    def _remove_urls(self, text: str) -> str:
        pattern = r'https?://[^\s<>"{}|\\^`\[\]]+|www\.[^\s<>"{}|\\^`\[\]]+'
        return re.sub(pattern, '[URL]', text)

    def _remove_page_numbers(self, text: str) -> str:
        patterns = [r'第\s*\d+\s*页', r'Page\s+\d+', r'\d+\s*/\s*\d+\s*页', r'-\s*\d+\s*-']
        for pattern in patterns:
            text = re.sub(pattern, '', text)
        return text

    def _remove_headers_footers(self, text: str, metadata: dict) -> str:
        lines = text.split('
')
        cleaned = []
        for line in lines:
            line = line.strip()
            if re.match(r'^[_\-=*]{3,}$', line):
                continue
            if len(line) < 50 and line and lines.count(line) > 3:
                continue
            cleaned.append(line)
        return '
'.join(cleaned)

    def _remove_copyright(self, text: str) -> str:
        patterns = [r'©\s*\d{4}[^
]*', r'版权所有[^
]*', r'All\s+rights\s+reserved', r'Copyright\s+[^
]*']
        for pattern in patterns:
            text = re.sub(pattern, '', text, flags=re.IGNORECASE)
        return text

    def _normalize_whitespace(self, text: str) -> str:
        text = text.replace('
', '
').replace('\r', '
')
        text = re.sub(r'[ \t]+', ' ', text)
        text = re.sub(r'
{3,}', '

', text)
        return text

    def _apply_custom_rules(self, text: str) -> str:
        for stop_sentence in self.config.custom_stop_sentences:
            text = text.replace(stop_sentence, '')
        for old, new in self.config.custom_replace_rules.items():
            text = text.replace(old, new)
        return text

    def _is_valid_content(self, text: str) -> bool:
        if not text or len(text.strip()) < 10:
            return False
        alpha_ratio = sum(c.isalnum() for c in text) / len(text)
        if alpha_ratio < 0.1:
            return False
        return True

class Deduplicator:
    """Hash + semantic deduplication"""

    def __init__(self, similarity_threshold: float = 0.85):
        self.similarity_threshold = similarity_threshold
        self.seen_hashes = set()
        self.seen_embeddings = []  # (embedding, text)

    def deduplicate_by_hash(self, documents: List[Document]) -> List[Document]:
        unique = []
        for doc in documents:
            h = hashlib.md5(doc.page_content.encode('utf-8')).hexdigest()
            if h not in self.seen_hashes:
                self.seen_hashes.add(h)
                unique.append(doc)
        removed = len(documents) - len(unique)
        if removed:
            logger.info(f"Hash deduplication removed {removed} duplicate docs")
        return unique

    def deduplicate_by_similarity(self, documents: List[Document], embeddings: List[List[float]]) -> List[Document]:
        from sklearn.metrics.pairwise import cosine_similarity
        import numpy as np
        if len(documents) <= 1:
            return documents
        unique = []
        emb_arr = np.array(embeddings)
        for doc, emb in zip(documents, emb_arr):
            duplicate = False
            for _, existing_emb in self.seen_embeddings:
                if cosine_similarity([emb], [existing_emb])[0][0] >= self.similarity_threshold:
                    duplicate = True
                    break
            if not duplicate:
                self.seen_embeddings.append((emb, doc.page_content))
                unique.append(doc)
        removed = len(documents) - len(unique)
        if removed:
            logger.info(f"Semantic deduplication removed {removed} similar docs")
        return unique

def build_cleaning_pipeline(config: Optional[CleaningConfig] = None,
                           enable_dedup: bool = True,
                           dedup_threshold: float = 0.85) -> Callable:
    """Construct the full cleaning pipeline"""
    cleaner = DocumentCleaner(config)
    deduplicator = Deduplicator(similarity_threshold=dedup_threshold)

    def pipeline(documents: List[Document]) -> List[Document]:
        cleaned = cleaner.clean(documents)
        logger.info(f"Basic cleaning: {len(documents)} → {len(cleaned)} docs")
        if enable_dedup:
            deduped = deduplicator.deduplicate_by_hash(cleaned)
            logger.info(f"Deduplication: {len(cleaned)} → {len(deduped)} docs")
            return deduped
        return cleaned
    return pipeline

Key Design Points

Configurable : All rules are driven by CleaningConfig, allowing scenario‑specific adjustments without code changes.

Logging : Each step logs its actions for debugging.

Progressive cleaning : Encoding normalization first, then noise removal, finally formatting.

Validity check : Documents that become too short or contain almost no alphanumerics are filtered out.

Practical Recommendations

Configurable Rule Sets

Example configurations for different document types:

# Technical docs – keep code blocks
tech_config = CleaningConfig(remove_html=True, remove_page_numbers=True, remove_copyright=False)

# FAQ – aggressive cleaning
faq_config = CleaningConfig(
    remove_html=True,
    remove_page_numbers=True,
    remove_headers_footers=True,
    remove_copyright=True,
    custom_stop_sentences=["点击此处了解更多", "如有疑问请联系"]
)

Backup Original Content

for doc in documents:
    doc.metadata['original_content'] = doc.page_content  # backup

Record Cleaning Logs

processing_log = []
for doc in documents:
    log_entry = {
        'source': doc.metadata.get('source'),
        'original_length': len(doc.page_content),
        'operations_applied': ['html_removal', 'page_num_removal'],
        'final_length': len(doc.page_content),
        'removed_ratio': 1 - len(doc.page_content) / len(doc.page_content)
    }
    processing_log.append(log_entry)

Validate Cleaning Effect

def validate_cleaning(documents):
    """Validate cleaning results"""
    issues = []
    for doc in documents:
        if re.search(r'<[^>]+>', doc.page_content):
            issues.append(f"HTML residue: {doc.metadata.get('source')}")
        if len(doc.page_content) > 50000:
            issues.append(f"Content too long: {doc.metadata.get('source')}")
        chinese_chars = len(re.findall(r'[\u4e00-\u9fa5]', doc.page_content))
        total = len(doc.page_content)
        if total and chinese_chars / total < 0.1:
            issues.append(f"Low Chinese ratio: {doc.metadata.get('source')}")
    return issues

Reflection Questions

When cleaning medical records that contain patient names and ID numbers, what additional privacy‑preserving steps should be taken beyond noise removal?

How would you balance thorough cleaning against processing cost and latency in a production RAG pipeline?

For OCR‑generated PDFs that confuse the letter “l” with the digit “1” or “O” with “0”, what strategy would you adopt?

AILangChainRAGDeduplicationEmbeddingdata cleaningPipeline
AI Architect Hub
Written by

AI Architect Hub

Discuss AI and architecture; a ten-year veteran of major tech companies now transitioning to AI and continuing the journey.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.