Big Data 6 min read

Information Fingerprint and Simhash Algorithm for Large-Scale Duplicate Detection

This article explains the concept of information fingerprints, compares traditional set‑equality methods, introduces the Simhash algorithm for high‑dimensional text similarity reduction, and demonstrates how partitioned 64‑bit fingerprints enable efficient duplicate detection on massive web data.

360 Quality & Efficiency
360 Quality & Efficiency
360 Quality & Efficiency
Information Fingerprint and Simhash Algorithm for Large-Scale Duplicate Detection

Information fingerprint is a short random number that uniquely identifies a piece of data such as text, video, or audio, similar to a human fingerprint; it is irreversible, meaning the original content cannot be derived from the fingerprint.

Common fingerprinting algorithms include MD5 and SHA1, which convert variable‑length data into fixed‑length 128‑bit or 160‑bit binary numbers.

Web crawler example: With 5 × 10^11 URLs (each >100 characters) requiring about 50 TB storage, a traditional hash‑table approach would double the storage to ~100 TB, whereas mapping each URL to a 128‑bit fingerprint reduces storage to 16 bytes per URL.

Set equality methods: Four approaches are described – pairwise comparison (O(N²)), sorting then comparing (O(N log N)), using a hash table (O(N) time, O(N) space), and comparing set fingerprints (O(N) time).

Set fingerprint is computed as the sum of element fingerprints, leveraging commutativity so order does not affect the result.

Simhash algorithm: Designed for text similarity on long documents, it maps high‑dimensional feature vectors to an f‑bit fingerprint and uses Hamming distance to measure similarity.

Traditional cosine similarity suffers from high dimensionality and computational cost; Simhash reduces dimensionality by converting features into a compact fingerprint.

Steps of Simhash: (1) Map the high‑dimensional vector to an f‑bit fingerprint; (2) Compare two fingerprints by calculating their Hamming distance.

Google’s web deduplication with Simhash: For 8 billion pages, a 64‑bit Simhash is used with a Hamming distance threshold of 3. Two practical methods are enumerating all fingerprints within distance 3 (≈41 664 candidates) or pre‑computing tables for each 16‑bit block.

By splitting the 64‑bit fingerprint into four 16‑bit blocks and applying the pigeonhole principle, only one matching block is needed to identify near‑duplicates, reducing the search space from ~1.7 × 10^10 comparisons to about 1 × 10^6 candidates, dramatically improving efficiency.

Big DataDuplicate Detectionhash algorithmsSimhashinformation fingerprintset similarity
360 Quality & Efficiency
Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.