How SimHash and Cosine Similarity Accelerate Large-Scale Text Deduplication

This article explains why traditional pairwise text comparison is impractical for massive news corpora, introduces cosine similarity and SimHash as efficient deduplication techniques, walks through their mathematical foundations, step‑by‑step implementation details, code examples, and discusses trade‑offs such as accuracy versus speed.

Sohu Smart Platform Tech Team
Sohu Smart Platform Tech Team
Sohu Smart Platform Tech Team
How SimHash and Cosine Similarity Accelerate Large-Scale Text Deduplication

1. Cosine Similarity

Cosine similarity measures the angle between two vectors to judge sentence similarity. After tokenizing two example sentences, a vocabulary list is built, term frequencies are counted, and each sentence is represented as a vector (e.g., A:[1,2,2,1,1,1,0], B:[1,2,2,1,1,2,1]). The cosine of the angle between these vectors indicates how similar the sentences are. While accurate, computing cosine similarity for massive text sets is computationally expensive.

2. SimHash Algorithm

SimHash, proposed by Google, maps a document to a binary fingerprint. Similar texts produce fingerprints that differ in only a few bits. The Hamming distance between two fingerprints quantifies similarity; a distance ≤ 3 usually means the texts are considered duplicates.

The figure shows two example strings and their SimHash fingerprints, highlighting the three differing bits.

3. SimHash Implementation Steps

1) Tokenization and weighting : Split the text, remove stop words, and assign a weight (1‑5) to each term using a tokenizer such as ansj and TF‑IDF.

public static List<String> splitWords(String str) {
    List<String> splitWords = new ArrayList<>(1000);
    Result terms = ToAnalysis.parse(str, forest);
    for (int i = 0; i < terms.size(); i++) {
        Term term = terms.get(i);
        String word = term.getName();
        if (!"".equals(word.trim()) && !stopWords.contains(word)) {
            splitWords.add(word);
        }
    }
    return splitWords;
}

public Map<String, Double> extract(String str) {
    List<String> words = WordsSegment.splitWords(str);
    // calculate tf and idf, build word map …
    return wordmap;
}

2) Hash each term : Compute a 64‑bit hash (e.g., FNV‑1a) for every term.

public static BigInteger fnv1aHash64(String str) {
    BigInteger hash = FNV_64_INIT;
    int len = str.length();
    for (int i = 0; i < len; i++) {
        hash = hash.xor(BigInteger.valueOf(str.charAt(i)));
        hash = hash.multiply(FNV_64_PRIME);
    }
    hash = hash.and(MASK_64);
    return hash;
}

3) Weighted accumulation : For each bit position, add the term weight if the bit is 1, subtract if 0, producing a feature vector.

4) Dimensionality reduction : Convert the feature vector to a binary SimHash fingerprint by setting bits ≥ 0 to 1 and < 0 to 0.

4. Space‑Time Trade‑off for Faster Deduplication

Directly comparing all SimHash fingerprints has O(n²) complexity. To achieve near‑O(1) lookup, each 64‑bit fingerprint is split into four 16‑bit segments, which serve as keys in an inverted index (e.g., stored in Redis). Documents sharing any identical segment are placed in the same bucket, allowing parallel deduplication of candidate groups.

The chart shows accuracy and recall versus Hamming distance, indicating that a threshold of 3 balances both.

Index construction stores each 16‑bit segment as a Redis key; when a new document arrives, its four keys are queried to retrieve candidate duplicates.

5. Conclusion

SimHash provides a fast, space‑efficient solution for large‑scale text deduplication, dramatically reducing computational cost compared to traditional cosine similarity. However, its accuracy on short texts can be lower, so the choice of algorithm should consider the specific data characteristics and business requirements.

algorithmbig datacosine similaritySimHashText Deduplication
Sohu Smart Platform Tech Team
Written by

Sohu Smart Platform Tech Team

The Sohu News app's technical sharing hub, offering deep tech analyses, the latest industry news, and fun developer anecdotes. Follow us to discover the team's daily joys.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.