Big Data 14 min read

How SimHash and Cosine Similarity Accelerate Large‑Scale Text Deduplication

This article explains why massive news feeds need efficient deduplication, compares cosine similarity and SimHash for measuring text similarity, walks through a step‑by‑step implementation with Java code, and shows how a space‑for‑time indexing strategy can reduce duplicate‑detection complexity from O(n²) to near O(1).

Sohu Tech Products

Feb 28, 2024

How SimHash and Cosine Similarity Accelerate Large‑Scale Text Deduplication

In personalized news‑feed scenarios, crawlers collect huge amounts of articles, many of which are near‑duplicates, causing storage bloat and a poorer user experience. A naïve pairwise comparison works only for tiny datasets, so an algorithm that remains accurate while lowering time complexity is required.

1. Cosine Similarity Basics

The core idea of cosine similarity is to treat each sentence as a vector and compute the cosine of the angle between the two vectors. The smaller the angle, the more similar the sentences.

Example sentences:

Sentence A: 我/喜欢/看/电视，不/喜欢/看/电影

Sentence B: 我/不/喜欢/看/电视，也/不/喜欢/看/电影

Steps:

Tokenize the sentences.

List all unique words: 我，喜欢，看，电视，电影，不，也.

Count term frequencies for each sentence.

Sentence A vector: [1,2,2,1,1,1,0]
Sentence B vector: [1,2,2,1,1,2,1]

Compute the cosine of the angle between the two vectors. A cosine of 0 means identical direction (perfect similarity), while 90° (cosine 0) means completely unrelated.

Cosine similarity yields precise results but becomes computationally expensive for massive text collections because each vector must be built and compared.

2. SimHash Algorithm Overview

SimHash, proposed by Google, maps a document to a 64‑bit binary string such that similar texts produce hashes with small Hamming distances. The Hamming distance between two hashes serves as a similarity metric.

Example strings:

t1 = "直击儿科急诊现状忙碌不止儿科接诊进行时"

t2 = "儿科急诊现状直击不停忙碌儿科接诊进行时"

Although the two sentences differ by only a few characters, their raw hash values can be completely different, while their SimHash values differ in only three bit positions. A Hamming distance ≤ 3 is typically used as the similarity threshold.

Advantages of SimHash:

Fast hash computation.

Similarity measured by simple bitwise operations.

Scales well to large corpora.

3. Detailed SimHash Implementation Steps

Tokenization and weighting : Split the text into words, remove stop words, and assign a weight (e.g., tf‑idf). Example weight levels 1‑5 are used.

Hash each token : Apply a 64‑bit hash function (e.g., FNV‑1a) to each word.

public static BigInteger fnv1aHash64(String str) {
    BigInteger hash = FNV_64_INIT;
    int len = str.length();
    for (int i = 0; i < len; i++) {
        hash = hash.xor(BigInteger.valueOf(str.charAt(i)));
        hash = hash.multiply(FNV_64_PRIME);
    }
    hash = hash.and(MASK_64);
    return hash;
}

Weighted accumulation : For each bit position, add the token’s weight if the bit is 1, subtract if it is 0, producing a feature vector.

Signature generation : Convert the feature vector to a binary signature: bits with non‑negative values become 1, otherwise 0.

for (int i = 0; i < FNVHash.HASH_BITS; i++) {
    if (featureVector[i] >= 0) {
        signature = signature.add(BigInteger.ONE.shiftLeft(FNVHash.HASH_BITS - i - 1));
        hashBuffer.append("1");
    } else {
        hashBuffer.append("0");
    }
}
this.hash = hashBuffer.toString();
this.signature = signature;

Index construction : Split the 64‑bit SimHash into four 16‑bit fragments, store each fragment as a Redis key, and keep the document ID, title, full hash, URL, and timestamp as the value.

String simHashFragment1 = simHash.substring(0, 16);
String redisKey1 = content_index_name + "_" + simHashFragment1;
NewRedisCrud.set2list(redisKey1, value, oid);
// repeat for fragments 2‑4

4. Space‑for‑Time Trade‑off

Directly searching for all hashes within a Hamming distance < n requires O(n²) time for large n. By partitioning each SimHash into n segments (commonly n = 4, each 16 bits), we can build an inverted index where each segment acts as a key. During lookup, only documents sharing at least one identical segment need to be examined, reducing the worst‑case complexity to O(n) and the best case to O(1). The trade‑off is a memory increase roughly proportional to the number of segments.

Typical settings: n = 4, giving an acceptable space expansion while achieving fast duplicate detection.

5. Conclusion

SimHash provides a significant speed advantage for large‑scale text deduplication compared with traditional cosine similarity, especially when dealing with billions of documents. While it excels in high‑throughput scenarios, its accuracy on very short texts can be lower, so the choice of algorithm should consider the specific business requirements.

algorithm Big Data cosine similarity SimHash Text Deduplication Near-Duplicate Detection

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.