Fundamentals 5 min read

How to Build a Python Script that Detects and Removes Duplicate Files

This article walks you through creating a Python automation script that scans a given folder for duplicate files using the os, glob, and filecmp modules, compares files, and safely deletes the redundant copies while handling edge cases.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Build a Python Script that Detects and Removes Duplicate Files

Introduction

In this tutorial we demonstrate a system‑level automation case: given a folder, use Python to check for duplicate files and delete any duplicates found.

Key Modules

os

– comprehensive file system operations glob – pattern‑based file discovery filecmp – compare two files for equality

Logic Overview

Traverse all files in the target directory, compare each pair, and delete the latter file when a duplicate is detected.

Using filecmp

The function filecmp.cmp(f1, f2, shallow=True) returns True if the files appear identical; with shallow=False it compares file contents.

# Assume x and y are two identical files
print(filecmp.cmp(x, y))
# True

Full Implementation

Import the required libraries and set the target directory:

import os
import glob
import filecmp

dir_path = r'C:\xxxx'

Collect absolute paths of all files using glob with the recursive flag:

file_lst = []
for i in glob.glob(dir_path + '/**/*', recursive=True):
    if os.path.isfile(i):
        file_lst.append(i)

Compare each pair of files and delete duplicates, guarding against missing files after a prior deletion:

for x in file_lst:
    for y in file_lst:
        if x != y and os.path.exists(x) and os.path.exists(y):
            if filecmp.cmp(x, y):
                os.remove(y)

The script provides a simple yet effective solution for batch file deduplication.

Conclusion

By automating this routine with Python, repetitive manual file‑management tasks are eliminated, showcasing the power of Python for office automation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Scriptingfilecmpfile-duplicateos-module
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.