Quickly Locate Duplicate Files on Linux with Find and dupeGuru
This guide shows three practical ways to identify duplicate files on Linux—using an advanced find‑pipeline, installing the cross‑platform dupeGuru utility, and a step‑by‑step breakdown of each command in the pipeline, complete with code examples and explanations.
Method 1: Using the find command
The find utility can be combined with other core Linux commands (such as xargs) to produce a powerful one‑liner that lists duplicate files by comparing their MD5 hashes.
find -not -empty -type f -printf "%s
" | sort -rn | uniq -d | \
xargs -I{} -n1 find -type f -size {}c -print0 | \
xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate find -not -empty -type f -printf "%s
"enumerates all non‑empty regular files and prints their sizes. sort -rn sorts the sizes in descending numeric order. uniq -d keeps only sizes that appear more than once, i.e., potential duplicates. uniq -w32 --all-repeated=separate compares the first 32 characters of the MD5 hash (the full hash length) and groups identical entries.
Method 2: Using the dupeGuru tool
dupeGuru is a cross‑platform application (Linux, Windows, macOS) that can locate duplicate files based on size, MD5, or filename. On Ubuntu you can install it via a PPA:
sudo add-apt-repository ppa:hsoft/ppa
sudo apt-get update
sudo apt-get install dupeguru*Method 3: Detailed find‑pipeline explanation
When you need to script duplicate‑file detection, the following expanded pipeline shows each stage and its purpose.
find -not -empty -type f -printf "%sn"
| sort -rn
| uniq -d
| xargs -I{} -n1 find -type f -size {}c -print0
| xargs -0 md5sum
| sort
| uniq -w32 --all-repeated=separate
| cut -b 36-
> result.txtExplanation of each command: find -not -empty -type f -printf "%sn" outputs the size (in bytes) of every non‑empty regular file. sort -rn sorts those sizes numerically in reverse order. uniq -d filters to sizes that occur more than once. xargs -I{} -n1 find -type f -size {}c -print0 converts each repeated size into a separate find call that lists files of that exact size, using a null terminator to safely handle spaces. xargs -0 md5sum computes the MD5 hash for each listed file. uniq -w32 --all-repeated=separate groups lines with identical first 32 characters (the full MD5 hash) and separates each group. cut -b 36- trims the leading file‑size column, leaving only the filename and hash for readability.
To make the result file Windows‑compatible (convert LF to CRLF), run: cat result.txt | cut -c 36- | tr -s 'n' '\r\n' This pipeline provides a concise, reproducible method for locating duplicate files across a directory tree without installing extra software.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
