Build a Powerful Python Text Replacement Tool for Data Cleaning
This guide walks you through creating a versatile Python text replacement utility—covering simple string swaps, regex-based changes, dictionary mappings, batch file processing, sensitive‑word filtering, and HTML tag cleaning—to streamline everyday data processing and text‑cleaning workflows.
Introduction
Text replacement is a common step in data cleaning and preprocessing. This guide provides a reusable Python utility that supports simple string swaps, regular‑expression replacements, dictionary‑based bulk mapping, case‑insensitive word substitution, batch file processing, sensitive‑word filtering, and HTML tag removal.
Project Structure
The core of the tool is the replacer.py module, which defines a Replacer class. The class can be instantiated with a text string, a pandas.DataFrame, or both, and offers a set of methods for different replacement scenarios.
Basic Replacement Functions
import re
import pandas as pd
from bs4 import BeautifulSoup
class Replacer:
def __init__(self, text=None, dataframe=None):
self.text = text
self.dataframe = dataframe
def replace_string(self, old, new, count=0):
"""Simple literal replacement"""
return self.text.replace(old, new, count)
def replace_regex(self, pattern, repl, flags=0):
"""Regular‑expression replacement"""
return re.sub(pattern, repl, self.text, flags=flags)
def replace_dict(self, mapping):
"""Replace multiple keywords using a dictionary. Keys may be strings or compiled regex patterns."""
for old, new in mapping.items():
if isinstance(old, str):
self.text = self.text.replace(old, new)
else:
self.text = re.sub(old, new, self.text)
return self.text
def replace_dataframe(self, column, mapping):
"""Apply a mapping to a specific DataFrame column."""
if self.dataframe is None:
raise ValueError("DataFrame not provided")
self.dataframe[column] = self.dataframe[column].replace(mapping)
return self.dataframe
def replace_words_case_insensitive(self, mapping):
"""Case‑insensitive word replacement based on a lower‑cased dictionary."""
words = self.text.split()
replaced = []
for word in words:
lower = word.lower()
replaced.append(mapping.get(lower, word))
return " ".join(replaced)
def batch_replace_files(self, input_dir, output_dir, replacements):
"""Replace text in all *.txt files under <code>input_dir</code> and write results to <code>output_dir</code>.
Parameters
----------
input_dir : str
Path to the directory containing source text files.
output_dir : str
Destination directory; created if it does not exist.
replacements : dict
Mapping of old substrings to new substrings.
"""
import os
if not os.path.exists(output_dir):
os.makedirs(output_dir)
for filename in os.listdir(input_dir):
if filename.endswith('.txt'):
src_path = os.path.join(input_dir, filename)
dst_path = os.path.join(output_dir, filename)
with open(src_path, 'r', encoding='utf-8') as f:
content = f.read()
for old, new in replacements.items():
content = content.replace(old, new)
with open(dst_path, 'w', encoding='utf-8') as f:
f.write(content)
def filter_sensitive_words(self, sensitive_words, replacement="*"):
"""Mask each occurrence of words in <code>sensitive_words</code> with <code>replacement</code> characters.
Example: "bad" -> "***" when replacement="*".
"""
pattern = '|'.join([re.escape(w) for w in sensitive_words])
return re.sub(pattern, lambda m: replacement * len(m.group()), self.text)
def clean_html_tags(self):
"""Strip HTML markup using BeautifulSoup and return plain text."""
soup = BeautifulSoup(self.text, "html.parser")
return soup.get_text()
# Example usage
if __name__ == "__main__":
sample = "Hello World! This is a test. Hello again."
r = Replacer(text=sample)
print(r.replace_string("Hello", "Hi"))
print(r.replace_regex(r"(hello|world)", "***", flags=re.IGNORECASE))
print(r.replace_dict({"test": "demo", "Hello": "Greeting"}))
print(r.replace_words_case_insensitive({"hello": "Hi", "world": "Earth"}))
r.batch_replace_files("input_texts/", "output_texts/", {"old_word": "new_word"})
print(r.filter_sensitive_words(["badword", "anotherbadword"]))
r.text = "Hello <strong>World</strong>!"
print(r.clean_html_tags())Advanced Features
Batch file replacement : Replace multiple terms across all .txt files in a directory with a single dictionary call.
Sensitive‑word filtering : Hide prohibited words by substituting each character with a placeholder (default *), useful for moderation pipelines.
HTML tag cleaning : Remove any HTML markup from scraped content, returning clean, readable text.
Installation
Required third‑party packages are pandas and beautifulsoup4:
pip install pandas beautifulsoup4Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
