Automatically Delete Duplicate Files with Python: Size & MD5 Check Script
Learn how to automatically identify and delete duplicate files in Python by first comparing file sizes and then using MD5 hashes, with a concise script that leverages pathlib for path handling and hashlib for checksum calculation, offering a fast, reliable solution for everyday file management tasks.
Introduction
Hello, I am a Python enthusiast sharing a practical solution for a common problem: many files with the same name accumulate automatically, and it becomes difficult to distinguish which file is which when there are dozens or hundreds of them.
Implementation
The approach uses two layers of judgment. First, it compares file sizes; files with different sizes are kept. Second, for files with identical sizes, it computes the MD5 hash; if the hashes match, the files are considered duplicates and are deleted.
The following script implements this logic, thanks to the contributor "Yuliang".
# -*- coding:utf-8 -*-
# @Time: 2022-09-21 13:20
# @Author: Yuliang
# 思路:两层判断:
# 1.先判断文件大小是否为相同,大小不同则不是重复文件,予以保留
# 2.文件大小相同再判断文件md5,md5相同,则是重复文件,予以删除
from pathlib import Path
import hashlib
def getmd5(filename):
# 接收文件路径,返回文件md5值
with open(filename, 'rb') as f:
data = f.read()
file_md5 = hashlib.new("md5", data).hexdigest()
return file_md5
def main():
path = r"E:\data"
all_size = {}
total_file = 0
total_delete = 0
# 获取路径内的所有文件名,默认是升序排列,相同文件将会保留日期时间最新的
all_files = Path(path).glob('*.*')
# 降序排列,相同文件将会保留文件名最短的(即日期时间最久的)
all_files = sorted(all_files, reverse=True)
# 遍历文件路径内的所有文件
for file in all_files:
# 获取文件所占字节大小,作为数据字典的键
size = file.stat().st_size
# name_and_md5列表用于存储文件绝对路径和md5值,作为数据字典的值
name_and_md5 = [file, '']
# 针对重复文件进行处理,生成字典存储相关信息
# 字典all_size中key是size,value是name_and_md5列表
# 针对相同size的文件,再调用getmd5函数,获取文件的md5值
# 文件size不同(不在all_size.keys()中),则自动判断为不同的文件,予以保留
if size in all_size.keys():
# 调用getmd5函数,获取文件的md5值
new_md5 = getmd5(file)
if all_size[size][1] == '':
all_size[size][1] = getmd5(all_size[size][0])
# 判断md5值存在,即文件重复,则删除文件。md5值不存在,则把md5值加入列表中
if new_md5 in all_size[size]:
file.unlink()
total_delete += 1
else:
all_size[size].append(new_md5)
else:
all_size[size] = name_and_md5
total_file += 1
print(f'文件总数:{total_file}')
print(f'删除个数:{total_delete}')
if __name__ == '__main__':
main()Running the script on a sample folder processes the files in a few seconds, demonstrating its speed and effectiveness.
Key Points
all_files = Path(path).glob('*.*')– Retrieves all file paths in the directory. size = file.stat().st_size – Obtains the file size in bytes. file.unlink() – Deletes the identified duplicate file.
The script uses MD5 hashing, which is highly precise: even a single character change results in a different hash, allowing the detection of even minimal differences between files.
Conclusion
This Python tool provides a fast, automated way to clean up duplicate files, and it can be packaged as a small utility or scheduled to run periodically, greatly simplifying file management tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
