How to Prevent Duplicate File Uploads with Reliable Hash Checks in Java
This article explains why using only file name and size to detect duplicate uploads is unreliable, demonstrates how to compute reliable file checksums in Java, and shows through experiments that identical content yields identical hashes regardless of name or type, providing a robust deduplication solution.
When implementing a file upload feature, a common requirement is to prevent uploading files that have identical content. Relying solely on file name and size, as some developers do, is not reliable because different files can share those attributes.
File Checksum Verification
If two files have the same content, their checksum (hash) should be identical. This property can be used to determine whether files are truly the same, independent of their names or extensions.
Java Implementation of File Checksum
The following utility method extracts a file's checksum using a specified algorithm (e.g., MD5, SHA-1, SHA-256):
/**
* Extract file checksum
*
* @param path Full file path
* @param algorithm Algorithm name, e.g., MD5, SHA-1, SHA-256
* @return checksum
* @throws NoSuchAlgorithmException the no such algorithm exception
* @throws IOException the io exception
*/
public static String extractChecksum(String path, String algorithm) throws NoSuchAlgorithmException, IOException {
// Initialize digest based on algorithm name
MessageDigest digest = MessageDigest.getInstance(algorithm);
// Read all bytes of the file
byte[] fileBytes = Files.readAllBytes(Paths.get(path));
// Update digest
digest.update(fileBytes);
// Complete hash computation and return the value
byte[] digested = digest.digest();
// Convert to hexadecimal string
return HexUtils.toHexString(digested);
}Verifying Consistency When Content Remains Unchanged
Running the checksum extraction multiple times on the same file should always produce the same result:
String path = "C:\\Users\\s1\\IdeaProjects\\demo\\src\\main\\resources\\application.yml";
String checksum = extractChecksum(path, "SHA-1");
String expectedHash = "6bf4d6c101b4a7821226d3ec1f8d778a531bf265";
Assertions.assertEquals(expectedHash, checksum);Changing the file name or extension (e.g., to application-dev.yml or application-dev.txt) does not affect the checksum as long as the content stays the same. Modifying the file content causes the checksum assertion to fail, confirming that the hash reflects content changes.
File Copy Test
Copying a file to a different location with a different name and type, while keeping the content unchanged, yields identical checksums:
String path1 = "C:\\Users\\s1\\IdeaProjects\\demo\\src\\main\\resources\\application.yml";
String path2 = "C:\\Users\\s1\\IdeaProjects\\demo\\src\\main\\resources\\templates\\application-dev.txt";
String checksum1 = extractChecksum(path1, "SHA-1");
String checksum2 = extractChecksum(path2, "SHA-1");
String expectedHash = "6bf4d6c101b4a7821226d3ec1f8d778a531bf265";
Assertions.assertEquals(expectedHash, checksum1);
Assertions.assertEquals(expectedHash, checksum2);If the content of either file is altered, the assertions fail, demonstrating that the hash is content‑dependent.
Creating an Empty File
An empty file produces a fixed checksum value for a given algorithm. For SHA‑1, the checksum of an empty file is:
da39a3ee5e6b4b0d3255bfef95601890afd80709Conclusion
Under the same algorithm, any newly created empty file has a constant checksum.
Any two files with identical content share the same checksum, regardless of path, name, or type.
The checksum changes whenever the file content changes.
Applications of File Checksums
Based on these findings, storing a file's checksum alongside its path can effectively prevent duplicate uploads of identical content. When handling empty files, be aware of their fixed checksum value. Java 12 also introduces new APIs for detecting duplicate file content, which are worth exploring. Beyond deduplication and tamper detection, checksums have many other uses—feel free to discuss them.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
