Fundamentals 7 min read

Build a Powerful Python Text Replacement Tool for Data Cleaning

This guide walks you through creating a versatile Python text replacement utility—covering simple string swaps, regex-based changes, dictionary mappings, batch file processing, sensitive‑word filtering, and HTML tag cleaning—to streamline everyday data processing and text‑cleaning workflows.

Test Development Learning Exchange

May 23, 2025

Build a Powerful Python Text Replacement Tool for Data Cleaning

Introduction

Text replacement is a common step in data cleaning and preprocessing. This guide provides a reusable Python utility that supports simple string swaps, regular‑expression replacements, dictionary‑based bulk mapping, case‑insensitive word substitution, batch file processing, sensitive‑word filtering, and HTML tag removal.

Project Structure

The core of the tool is the replacer.py module, which defines a Replacer class. The class can be instantiated with a text string, a pandas.DataFrame, or both, and offers a set of methods for different replacement scenarios.

Basic Replacement Functions

import re
import pandas as pd
from bs4 import BeautifulSoup

class Replacer:
    def __init__(self, text=None, dataframe=None):
        self.text = text
        self.dataframe = dataframe

    def replace_string(self, old, new, count=0):
        """Simple literal replacement"""
        return self.text.replace(old, new, count)

    def replace_regex(self, pattern, repl, flags=0):
        """Regular‑expression replacement"""
        return re.sub(pattern, repl, self.text, flags=flags)

    def replace_dict(self, mapping):
        """Replace multiple keywords using a dictionary. Keys may be strings or compiled regex patterns."""
        for old, new in mapping.items():
            if isinstance(old, str):
                self.text = self.text.replace(old, new)
            else:
                self.text = re.sub(old, new, self.text)
        return self.text

    def replace_dataframe(self, column, mapping):
        """Apply a mapping to a specific DataFrame column."""
        if self.dataframe is None:
            raise ValueError("DataFrame not provided")
        self.dataframe[column] = self.dataframe[column].replace(mapping)
        return self.dataframe

    def replace_words_case_insensitive(self, mapping):
        """Case‑insensitive word replacement based on a lower‑cased dictionary."""
        words = self.text.split()
        replaced = []
        for word in words:
            lower = word.lower()
            replaced.append(mapping.get(lower, word))
        return " ".join(replaced)

    def batch_replace_files(self, input_dir, output_dir, replacements):
        """Replace text in all *.txt files under <code>input_dir</code> and write results to <code>output_dir</code>.
        
        Parameters
        ----------
        input_dir : str
            Path to the directory containing source text files.
        output_dir : str
            Destination directory; created if it does not exist.
        replacements : dict
            Mapping of old substrings to new substrings.
        """
        import os
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        for filename in os.listdir(input_dir):
            if filename.endswith('.txt'):
                src_path = os.path.join(input_dir, filename)
                dst_path = os.path.join(output_dir, filename)
                with open(src_path, 'r', encoding='utf-8') as f:
                    content = f.read()
                for old, new in replacements.items():
                    content = content.replace(old, new)
                with open(dst_path, 'w', encoding='utf-8') as f:
                    f.write(content)

    def filter_sensitive_words(self, sensitive_words, replacement="*"):
        """Mask each occurrence of words in <code>sensitive_words</code> with <code>replacement</code> characters.
        
        Example: "bad" -> "***" when replacement="*".
        """
        pattern = '|'.join([re.escape(w) for w in sensitive_words])
        return re.sub(pattern, lambda m: replacement * len(m.group()), self.text)

    def clean_html_tags(self):
        """Strip HTML markup using BeautifulSoup and return plain text."""
        soup = BeautifulSoup(self.text, "html.parser")
        return soup.get_text()

# Example usage
if __name__ == "__main__":
    sample = "Hello World! This is a test. Hello again."
    r = Replacer(text=sample)
    print(r.replace_string("Hello", "Hi"))
    print(r.replace_regex(r"(hello|world)", "***", flags=re.IGNORECASE))
    print(r.replace_dict({"test": "demo", "Hello": "Greeting"}))
    print(r.replace_words_case_insensitive({"hello": "Hi", "world": "Earth"}))
    r.batch_replace_files("input_texts/", "output_texts/", {"old_word": "new_word"})
    print(r.filter_sensitive_words(["badword", "anotherbadword"]))
    r.text = "Hello <strong>World</strong>!"
    print(r.clean_html_tags())

Advanced Features

Batch file replacement : Replace multiple terms across all .txt files in a directory with a single dictionary call.

Sensitive‑word filtering : Hide prohibited words by substituting each character with a placeholder (default *), useful for moderation pipelines.

HTML tag cleaning : Remove any HTML markup from scraped content, returning clean, readable text.

Installation

Required third‑party packages are pandas and beautifulsoup4:

pip install pandas beautifulsoup4

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python regex text processing sensitive words batch replace HTML cleaning

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.