RAG Level 1: Avoid Dirty Data Poisoning Your AI – A Data Cleaning Guide
This article explains why noisy documents cripple Retrieval‑Augmented Generation, enumerates common garbage data types, describes three typical data‑quality problems, warns against over‑cleaning, encoding, and regex pitfalls, and provides a configurable LangChain pipeline with deduplication and validation best practices.
Problem Scenario
A chatbot built for customer service returns an unrelated policy when asked "How to reset password". The root cause is not the language model but noisy data injected during PDF parsing—headers, footers, watermarks, and repeated copyright notices become part of the embeddings.
Garbage in, garbage out.
Impact of Dirty Data on Embeddings
Embedding models generate vectors from the semantic content of the input text. When the input contains noisy fragments, the resulting vector mixes useful semantics (e.g., password‑management knowledge) with useless fragments (page numbers, navigation links, legal boilerplate, copyright notices). This dilutes the target concept and harms retrieval accuracy.
Typical Noise Types
Header/Footer – repeated page markers such as "公司保密文件 第3页".
Watermark text – words like "机密", "Draft" that mislead semantic judgment.
Navigation menu – strings like "[首页][返回][联系我们]" that add meaningless commands.
Copyright notice – repetitive statements like "© 2024 XXX Corp." that dilute the theme.
HTML tags – raw tags (e.g., <div><span class="xxx">) that pollute pure text semantics.
OCR garble – unreadable characters (e.g., □■■�) that cause vector anomalies.
Duplicate paragraphs – repeated content that over‑weights retrieval results.
Encoding issues – invisible characters such as \u200b, \ufeff, or BOM headers that interfere with downstream processing.
Three Typical "Diseases" in Document Structure
Structure loss : Markdown headings (e.g., "# Model Deployment") collapse into a flat text block, causing unrelated sections to mix during chunking.
Table fragmentation : Tabular data is split into separate lines, destroying column relationships.
Semantic fragmentation : A complete FAQ is broken into multiple fragments, so answers become partial.
Common Cleaning Traps
Trap 1 – Over‑cleaning (throwing away the gold)
Removing all non‑alphanumeric characters also deletes useful punctuation and numbers. Example of a bad practice:
# Bad practice: delete all non‑alphanumeric characters
text = re.sub(r'[^a-zA-Z0-9\u4e00-\u9fa5]', '', text)
# Result: "Python >= 3.8, 建议使用 >= 3.10 版本"
# → "Python 38 建议使用 310 版本"Correct approach: delete only control characters.
# Good practice: delete control characters only
text = re.sub(r'[\x00-\x1f\x7f-\x9f]', '', text)Stop‑word removal should be experimented with because words like "不", "或", "必须" remain semantically important for embedding retrieval.
Trap 2 – Encoding Pitfalls (invisible traps)
Common invisible issues:
# BOM causes JSON parsing failure
text = "\ufeff{\"name\": \"张三\"}"
# Full‑width vs half‑width characters
"Python=Python" vs "Python=Python"
# Mixed encodings (UTF‑8 + GBK)Robust handling:
import unicodedata
def normalize_encoding(text):
# Unicode NFC normalization
text = unicodedata.normalize('NFC', text)
# Remove BOM
text = text.replace('\ufeff', '')
return textTrap 3 – Regex Overkill (good text gets killed)
A regex that deletes any line consisting solely of digits also removes meaningful years or protocol names.
# Bad: delete any line that is only digits
text = re.sub(r'^\d+$', '', text, flags=re.MULTILINE)
# Result: "2024", "HTTP/2" are removedSafer practice: combine regex with contextual checks rather than a blind one‑liner.
Code Demo – LangChain‑Based Cleaning Pipeline
The following configurable pipeline implements the concepts above.
"""RAG Document Cleaning Pipeline – Level 1 code"""
import re, hashlib, logging
from dataclasses import dataclass, field
from typing import List, Callable, Optional
from langchain_core.documents import Document
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@dataclass
class CleaningConfig:
"""Cleaning configuration options"""
remove_html: bool = True
remove_urls: bool = True
remove_page_numbers: bool = True
remove_headers_footers: bool = True
remove_copyright: bool = True
normalize_whitespace: bool = True
normalize_encoding: bool = True
remove_control_chars: bool = True
custom_stop_sentences: List[str] = field(default_factory=list)
custom_replace_rules: dict = field(default_factory=dict)
class DocumentCleaner:
"""Document cleaner"""
def __init__(self, config: Optional[CleaningConfig] = None):
self.config = config or CleaningConfig()
self.processing_log: List[dict] = []
def clean(self, documents: List[Document]) -> List[Document]:
cleaned_docs = []
for doc in documents:
original_text = doc.page_content
cleaned_text = self._clean_text(original_text, doc.metadata)
if self._is_valid_content(cleaned_text):
doc.page_content = cleaned_text
cleaned_docs.append(doc)
logger.debug(f"Cleaned: {len(original_text)} → {len(cleaned_text)} chars")
else:
logger.warning(f"Skipped short content: {doc.metadata.get('source', 'unknown')}")
return cleaned_docs
def _clean_text(self, text: str, metadata: dict) -> str:
# 1. Encoding normalization
if self.config.normalize_encoding:
text = self._normalize_encoding(text)
# 2. Control character removal
if self.config.remove_control_chars:
text = self._remove_control_characters(text)
# 3. HTML tag removal
if self.config.remove_html:
text = self._remove_html_tags(text)
# 4. URL removal
if self.config.remove_urls:
text = self._remove_urls(text)
# 5. Page number removal
if self.config.remove_page_numbers:
text = self._remove_page_numbers(text)
# 6. Header/footer removal
if self.config.remove_headers_footers:
text = self._remove_headers_footers(text, metadata)
# 7. Copyright removal
if self.config.remove_copyright:
text = self._remove_copyright(text)
# 8. Whitespace normalization
if self.config.normalize_whitespace:
text = self._normalize_whitespace(text)
# 9. Custom rules
text = self._apply_custom_rules(text)
return text.strip()
def _normalize_encoding(self, text: str) -> str:
import unicodedata
text = unicodedata.normalize('NFC', text)
text = text.replace('\ufeff', '')
return text
def _remove_control_characters(self, text: str) -> str:
return re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]', '', text)
def _remove_html_tags(self, text: str) -> str:
text = re.sub(r'<[^>]+>', ' ', text)
text = re.sub(r'&[a-zA-Z]+;', ' ', text)
text = re.sub(r'&#\d+;', ' ', text)
return text
def _remove_urls(self, text: str) -> str:
pattern = r'https?://[^\s<>"{}|\\^`\[\]]+|www\.[^\s<>"{}|\\^`\[\]]+'
return re.sub(pattern, '[URL]', text)
def _remove_page_numbers(self, text: str) -> str:
patterns = [r'第\s*\d+\s*页', r'Page\s+\d+', r'\d+\s*/\s*\d+\s*页', r'-\s*\d+\s*-']
for pattern in patterns:
text = re.sub(pattern, '', text)
return text
def _remove_headers_footers(self, text: str, metadata: dict) -> str:
lines = text.split('
')
cleaned = []
for line in lines:
line = line.strip()
if re.match(r'^[_\-=*]{3,}$', line):
continue
if len(line) < 50 and line and lines.count(line) > 3:
continue
cleaned.append(line)
return '
'.join(cleaned)
def _remove_copyright(self, text: str) -> str:
patterns = [r'©\s*\d{4}[^
]*', r'版权所有[^
]*', r'All\s+rights\s+reserved', r'Copyright\s+[^
]*']
for pattern in patterns:
text = re.sub(pattern, '', text, flags=re.IGNORECASE)
return text
def _normalize_whitespace(self, text: str) -> str:
text = text.replace('
', '
').replace('\r', '
')
text = re.sub(r'[ \t]+', ' ', text)
text = re.sub(r'
{3,}', '
', text)
return text
def _apply_custom_rules(self, text: str) -> str:
for stop_sentence in self.config.custom_stop_sentences:
text = text.replace(stop_sentence, '')
for old, new in self.config.custom_replace_rules.items():
text = text.replace(old, new)
return text
def _is_valid_content(self, text: str) -> bool:
if not text or len(text.strip()) < 10:
return False
alpha_ratio = sum(c.isalnum() for c in text) / len(text)
if alpha_ratio < 0.1:
return False
return True
class Deduplicator:
"""Hash + semantic deduplication"""
def __init__(self, similarity_threshold: float = 0.85):
self.similarity_threshold = similarity_threshold
self.seen_hashes = set()
self.seen_embeddings = [] # (embedding, text)
def deduplicate_by_hash(self, documents: List[Document]) -> List[Document]:
unique = []
for doc in documents:
h = hashlib.md5(doc.page_content.encode('utf-8')).hexdigest()
if h not in self.seen_hashes:
self.seen_hashes.add(h)
unique.append(doc)
removed = len(documents) - len(unique)
if removed:
logger.info(f"Hash deduplication removed {removed} duplicate docs")
return unique
def deduplicate_by_similarity(self, documents: List[Document], embeddings: List[List[float]]) -> List[Document]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
if len(documents) <= 1:
return documents
unique = []
emb_arr = np.array(embeddings)
for doc, emb in zip(documents, emb_arr):
duplicate = False
for _, existing_emb in self.seen_embeddings:
if cosine_similarity([emb], [existing_emb])[0][0] >= self.similarity_threshold:
duplicate = True
break
if not duplicate:
self.seen_embeddings.append((emb, doc.page_content))
unique.append(doc)
removed = len(documents) - len(unique)
if removed:
logger.info(f"Semantic deduplication removed {removed} similar docs")
return unique
def build_cleaning_pipeline(config: Optional[CleaningConfig] = None,
enable_dedup: bool = True,
dedup_threshold: float = 0.85) -> Callable:
"""Construct the full cleaning pipeline"""
cleaner = DocumentCleaner(config)
deduplicator = Deduplicator(similarity_threshold=dedup_threshold)
def pipeline(documents: List[Document]) -> List[Document]:
cleaned = cleaner.clean(documents)
logger.info(f"Basic cleaning: {len(documents)} → {len(cleaned)} docs")
if enable_dedup:
deduped = deduplicator.deduplicate_by_hash(cleaned)
logger.info(f"Deduplication: {len(cleaned)} → {len(deduped)} docs")
return deduped
return cleaned
return pipelineKey Design Points
Configurable : All rules are driven by CleaningConfig, allowing scenario‑specific adjustments without code changes.
Logging : Each step logs its actions for debugging.
Progressive cleaning : Encoding normalization first, then noise removal, finally formatting.
Validity check : Documents that become too short or contain almost no alphanumerics are filtered out.
Practical Recommendations
Configurable Rule Sets
Example configurations for different document types:
# Technical docs – keep code blocks
tech_config = CleaningConfig(remove_html=True, remove_page_numbers=True, remove_copyright=False)
# FAQ – aggressive cleaning
faq_config = CleaningConfig(
remove_html=True,
remove_page_numbers=True,
remove_headers_footers=True,
remove_copyright=True,
custom_stop_sentences=["点击此处了解更多", "如有疑问请联系"]
)Backup Original Content
for doc in documents:
doc.metadata['original_content'] = doc.page_content # backupRecord Cleaning Logs
processing_log = []
for doc in documents:
log_entry = {
'source': doc.metadata.get('source'),
'original_length': len(doc.page_content),
'operations_applied': ['html_removal', 'page_num_removal'],
'final_length': len(doc.page_content),
'removed_ratio': 1 - len(doc.page_content) / len(doc.page_content)
}
processing_log.append(log_entry)Validate Cleaning Effect
def validate_cleaning(documents):
"""Validate cleaning results"""
issues = []
for doc in documents:
if re.search(r'<[^>]+>', doc.page_content):
issues.append(f"HTML residue: {doc.metadata.get('source')}")
if len(doc.page_content) > 50000:
issues.append(f"Content too long: {doc.metadata.get('source')}")
chinese_chars = len(re.findall(r'[\u4e00-\u9fa5]', doc.page_content))
total = len(doc.page_content)
if total and chinese_chars / total < 0.1:
issues.append(f"Low Chinese ratio: {doc.metadata.get('source')}")
return issuesReflection Questions
When cleaning medical records that contain patient names and ID numbers, what additional privacy‑preserving steps should be taken beyond noise removal?
How would you balance thorough cleaning against processing cost and latency in a production RAG pipeline?
For OCR‑generated PDFs that confuse the letter “l” with the digit “1” or “O” with “0”, what strategy would you adopt?
AI Architect Hub
Discuss AI and architecture; a ten-year veteran of major tech companies now transitioning to AI and continuing the journey.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
