How to Build an Image Duplicate Detection System

This article explains how to construct an image duplicate and near‑duplicate detection system, compares five similarity methods (Euclidean distance, SSIM, image hashing, cosine similarity, and CNN‑based feature similarity), provides Python implementations, evaluates them on two datasets, and discusses speed, accuracy, and robustness results.

Code DAO
Code DAO
Code DAO
How to Build an Image Duplicate Detection System

This article explains how to construct a system that detects duplicate and near‑duplicate images.

Objectives

Understand the difference between an image duplicate finder and a content‑based image retrieval (CBIR) system.

Learn five different methods for comparing similar images.

Implement each method in Python.

Determine how image transformations affect the overall performance of the algorithms.

Guide the selection of the best method for a given application based on speed and accuracy (including experiments).

Basic Architecture

A query image is the image supplied by the user. The similarity block searches the dataset for images that are similar to the query and computes a similarity score (see Figure 1).

Duplicate Finder vs. CBIR

The main difference is that a duplicate/near‑duplicate finder only detects identical or almost identical images (Figure 2), whereas a CBIR system searches for visually similar regions and returns images that best match those regions (Figure 3).

Five Common Methods for Comparing Similar Images

1. Euclidean Distance

Euclidean distance is the straight‑line distance between two points in a plane (also called L2 norm) [8]. Images are represented as vectors; the distance between two image vectors x and y is computed using the standard formula.

Implementation in Python (using SciPy):

import numpy as np
from scipy.spatial import distance
from PIL import Image

image1 = Image.open("path/to/image")
image2 = Image.open("path/to/image")
# Note: images have to be of equal size
value = distance.euclidean(np.array(image1).flatten(), np.array(image2).flatten())

Implementation using NumPy's linalg.norm:

import numpy as np
from PIL import Image

image1 = Image.open("path/to/image")
image2 = Image.open("path/to/image")
# Note: images have to be of equal size
value = np.linalg.norm(np.array(image1) - np.array(image2))

2. Structural Similarity Index (SSIM)

The SSIM metric was introduced in the paper "Image Quality Assessment: From Error Visibility to Structural Similarity" (Wang et al., 2004) and yields a value between 0 and 1 that measures similarity between two images.

It is also used to assess compression quality [2] and transmission loss [2]. The three factors influencing SSIM are brightness, contrast, and structure [3].

Python implementation:

from SSIM_PIL import compare_ssim
from PIL import Image

image1 = Image.open("path/to/image")
image2 = Image.open("path/to/image")
# Note: images have to be of equal size
value = compare_ssim(image1, image2, GPU=False)  # 1.0 indicates perfect similarity

3. Image Hashing

Image hashing creates a digital fingerprint for each image. The average hash works by resizing the image (e.g., 8×8), converting to grayscale, computing the mean pixel value, and setting each pixel to 1 if above the mean, otherwise 0. The 64‑bit result can be compared using Hamming distance.

Example hash:

1011111101100001110001110000111101101111100001110000001100001001

Python implementation (using imagehash):

import imagehash
from PIL import Image

image1 = Image.open("path/to/image")
image2 = Image.open("path/to/image")
# Note: images DO NOT have to be of equal size
hash1 = imagehash.average_hash(image1)
hash2 = imagehash.average_hash(image2)
value = hash1 - hash2  # Hamming distance; 0 means identical

4. Cosine Similarity

Cosine similarity measures the angle between two vectors: the smaller the angle, the higher the similarity [9].

Python implementation with PyTorch:

from torch import nn
from PIL import Image
from torchvision import transforms

image1 = Image.open("path/to/image")
image2 = Image.open("path/to/image")
# Note: images have to be of equal size
image1_tensor = transforms.ToTensor()(image1).reshape(1, -1).squeeze()
image2_tensor = transforms.ToTensor()(image2).reshape(1, -1).squeeze()
cos = nn.CosineSimilarity(dim=0)
value = float(cos(image1_tensor, image2_tensor))  # 1.0 = identical

5. Feature Similarity Using CNNs

Convolutional Neural Networks extract high‑level features (edges, shapes, textures). Similarity can be measured by comparing these feature vectors.

EfficientNet‑b0 is used as the backbone. After extracting features, Euclidean distance or cosine similarity can be applied.

EfficientNet‑b0 + Euclidean distance:

from efficientnet_pytorch import EfficientNet
import numpy as np
from PIL import Image
from torchvision import transforms

model = EfficientNet.from_pretrained('efficientnet-b0')
model.eval()

image1 = Image.open("path/to/image")
image2 = Image.open("path/to/image")
# Note: images have to be of equal size
image1_tensor = transforms.ToTensor()(image1)
image2_tensor = transforms.ToTensor()(image2)
features1 = model.extract_features(image1_tensor.unsqueeze(0))
features2 = model.extract_features(image2_tensor.unsqueeze(0))
value = round(np.linalg.norm(np.array(features1.detach()) - np.array(features2.detach())), 4)

EfficientNet‑b0 + Cosine similarity:

from efficientnet_pytorch import EfficientNet
from PIL import Image
from torchvision import transforms
from torch import nn

model = EfficientNet.from_pretrained('efficientnet-b0')
model.eval()

image1 = Image.open("path/to/image")
image2 = Image.open("path/to/image")
# Note: images have to be of equal size
image1_tensor = transforms.ToTensor()(image1)
image2_tensor = transforms.ToTensor()(image2)
features1 = model.extract_features(image1_tensor.unsqueeze(0))
features2 = model.extract_features(image2_tensor.unsqueeze(0))
cos = nn.CosineSimilarity(dim=0)
value = round(float(cos(features1.reshape(1, -1).squeeze(), features2.reshape(1, -1).squeeze())), 4)

Datasets

A subset of the Fruits360 dataset (96 multi‑size images, licensed CC‑BY‑SA 4.0).

SFBench dataset (40 images of size 3024×4032, public domain).

Fruits360 provides many near‑duplicate fruit images captured from 360° angles (Figure 11). SFBench is used to test robustness to transformations such as 3‑D projection and rotation.

Experiments

Experiment 1 – Speed and Accuracy

Steps:

Read images from the Fruits360 dataset.

Convert to RGB.

Resize to a fixed size.

Apply each of the five methods.

Retrieve the three most similar images for each query.

Measure average processing time per image pair (seconds).

Compute accuracy (100 % if at least one of the three retrieved images is a duplicate/near‑duplicate).

Results (Table 1) show that cosine similarity dominates in accuracy, while image hashing offers the best speed‑accuracy trade‑off; CNN‑based feature similarity is ~250× slower than cosine similarity with comparable accuracy.

Experiment 2 – Robustness to Image Transformations

The same procedure is repeated on the SFBench dataset, focusing on how each method tolerates rotations, scaling, and 3‑D projections.

Feature‑based similarity (CNN) performs best because the network preserves spatial information, as summarized in Table 2.

Experiment 3 – Scipy distance.euclidean vs. NumPy linalg.norm Speed (Extra)

Approximately 2,300 repeated distance calculations were timed for both implementations. The results (Table 3) indicate comparable performance.

Conclusion

The article presented the concepts of Euclidean distance, SSIM, image hashing, cosine similarity, and CNN‑based feature similarity, and evaluated their sensitivity to image transformations. Cosine similarity offers the best overall accuracy, image hashing provides the fastest runtime, and CNN‑based methods excel when robustness to complex transformations is required.

References

[1] Wang et al., "Image Quality Assessment: From Error Visibility to Structural Similarity", 2004.

[2] Imatest LLC, SSIM: Structural Similarity Index, v.22.1.

[3] Datta, "All about Structural Similarity Index (SSIM): Theory + Code in PyTorch", 2020.

[4] Mathematics LibreTexts, "A Further Applications of Trigonometry: Vectors", 2021.

[5] Nagella, "Cosine Similarity Vs Euclidean Distance", 2019.

[6] The Content Blockchain Project, "Testing Different Image Hash Functions", 2019.

[7] Krawetz, "Looks Like It", 2011.

[8] Gohrani, "Different Types of Distance Metrics used in Machine Learning", 2019.

[9] Clay, "How to calculate Cosine Similarity (With code)", 2020.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CNNPythonEfficientNetimage hashingsimilarity metricsimage duplicate detection
Code DAO
Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.