Big Data 6 min read

Information Fingerprint and Simhash Algorithm for Large-Scale Duplicate Detection

This article explains the concept of information fingerprints, compares traditional set‑equality methods, introduces the Simhash algorithm for high‑dimensional text similarity reduction, and demonstrates how partitioned 64‑bit fingerprints enable efficient duplicate detection on massive web data.

360 Quality & Efficiency

Oct 19, 2018

Information Fingerprint and Simhash Algorithm for Large-Scale Duplicate Detection

Information fingerprint is a short random number that uniquely identifies a piece of data such as text, video, or audio, similar to a human fingerprint; it is irreversible, meaning the original content cannot be derived from the fingerprint.

Common fingerprinting algorithms include MD5 and SHA1, which convert variable‑length data into fixed‑length 128‑bit or 160‑bit binary numbers.

Web crawler example: With 5 × 10^11 URLs (each >100 characters) requiring about 50 TB storage, a traditional hash‑table approach would double the storage to ~100 TB, whereas mapping each URL to a 128‑bit fingerprint reduces storage to 16 bytes per URL.

Set equality methods: Four approaches are described – pairwise comparison (O(N²)), sorting then comparing (O(N log N)), using a hash table (O(N) time, O(N) space), and comparing set fingerprints (O(N) time).

Set fingerprint is computed as the sum of element fingerprints, leveraging commutativity so order does not affect the result.

Simhash algorithm: Designed for text similarity on long documents, it maps high‑dimensional feature vectors to an f‑bit fingerprint and uses Hamming distance to measure similarity.

Traditional cosine similarity suffers from high dimensionality and computational cost; Simhash reduces dimensionality by converting features into a compact fingerprint.

Steps of Simhash: (1) Map the high‑dimensional vector to an f‑bit fingerprint; (2) Compare two fingerprints by calculating their Hamming distance.

Google’s web deduplication with Simhash: For 8 billion pages, a 64‑bit Simhash is used with a Hamming distance threshold of 3. Two practical methods are enumerating all fingerprints within distance 3 (≈41 664 candidates) or pre‑computing tables for each 16‑bit block.

By splitting the 64‑bit fingerprint into four 16‑bit blocks and applying the pigeonhole principle, only one matching block is needed to identify near‑duplicates, reducing the search space from ~1.7 × 10^10 comparisons to about 1 × 10^6 candidates, dramatically improving efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data duplicate detection hash algorithms SimHash information fingerprint set similarity

Written by

360 Quality & Efficiency

360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.