Fundamentals 6 min read

How to Build a Python Script that Detects and Removes Duplicate Files

This tutorial walks through creating a Python automation script that scans a given directory, uses the os, glob, and filecmp modules to identify duplicate files, and safely deletes the redundant copies while handling edge cases such as missing files.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Build a Python Script that Detects and Removes Duplicate Files

Introduction

Hello everyone, it's time for the Python office automation series. This article presents a system‑level automation case: given a folder, use Python to check for duplicate files and delete them.

Key Modules

os

– comprehensive usage glob – comprehensive usage filecmp – compare two files

Step Analysis

The program traverses all files in the target folder, compares each pair, and deletes the latter when duplicates are found.

Traverse the folder, compare files pairwise, delete duplicates.

The crucial question is how to determine if two files are identical. The filecmp module provides filecmp.cmp(f1, f2, shallow=True), which returns True if files are considered equal (based on os.stat() when shallow is true) or False otherwise.

# Assume x and y are identical files
print(filecmp.cmp(x, y))
# True

Python Implementation

Import libraries and set the target directory:

import os
import glob
import filecmp

dir_path = r'C:\xxxx'

Collect absolute paths of all files using glob with recursive=True:

for file in glob.glob(dir_path + '/**/*', recursive=True):
    pass

Build a list of file paths:

file_lst = []
for i in glob.glob(dir_path + '/**/*', recursive=True):
    if os.path.isfile(i):
        file_lst.append(i)

Compare each pair with filecmp.cmp and delete duplicates, checking existence to avoid errors:

for x in file_lst:
    for y in file_lst:
        if x != y and os.path.exists(x) and os.path.exists(y):
            if filecmp.cmp(x, y):
                os.remove(y)

The complete script combines the above steps.

Conclusion

This simple duplicate‑file remover demonstrates the power of Python for office automation and can be combined with other file‑organizing scripts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

scriptglobOS modulefile deduplicationfilecmp
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.