Artificial Intelligence 29 min read

How Simhash and Minhash Power LLM Data Deduplication: Theory and Spark Code

This article explains document‑level, paragraph‑level, and sentence‑level deduplication for large‑scale LLM pre‑training, introduces the Simhash and Minhash algorithms with step‑by‑step Python examples, and shows how to implement efficient LSH‑based deduplication using Spark.

Baobao Algorithm Notes

Oct 25, 2024

How Simhash and Minhash Power LLM Data Deduplication: Theory and Spark Code

Simhash Algorithm

Simhash starts by tokenizing a document with a BPE tokenizer (e.g., Qwen, LLaMA, or OpenAI's tokenizer). Each token is hashed using MD5, converted to a 128‑bit binary string, and mapped to a vector where bits equal to 1 become 1 and bits equal to 0 become -1. The document vector is obtained by summing all token vectors and then clipping the result to the range 0 – 1. Similarity between two documents is measured by the Hamming distance of their clipped vectors.

from hashlib import md5
import numpy as np

def hash(text):
    return md5(text.encode()).hexdigest()

words = ['不能','复现','的','软件','不算','开源软件']
for word in words:
    print(f"{word}:{hash(word)}")

def encode_word(word_hash):
    bin_hash = bin(int(word_hash, 16))[2:].zfill(128)
    return np.array([1 if bit == '1' else -1 for bit in bin_hash])

word_embeddings = [encode_word(hash(w)) for w in words]
print(np.clip(sum(word_embeddings), 0, 1))

The clipping step ensures that the final vector contains only 0 and 1 values, making Hamming distance a suitable similarity metric. Simhash works well because adding or removing a few tokens only slightly perturbs the summed vector, keeping the clipped result stable for near‑duplicate documents.

Minhash Algorithm

Minhash shares the first two steps with Simhash (tokenization and MD5 hashing). The third step differs: for each of N hash functions, the minimum hash value among all tokens becomes one dimension of the document vector. The implementation uses a base MD5 hash transformed by linear functions to generate many pseudo‑independent hash functions.

def hash_new(text, a, b, prime=4294967311):
    return (int(hash(text), 16) * a + b) % prime

import random
random.seed(3407)
N = 200
A = [random.randint(1, (1 << 32) - 1) for _ in range(N)]
B = [random.randint(1, (1 << 32) - 1) for _ in range(N)]

doc_embedding = []
for i in range(N):
    doc_embedding.append(min([hash_new(word, A[i], B[i]) for word in words]))
print(doc_embedding)

Similarity is computed as the Jaccard similarity between the sets of minima from two documents. Minhash is less sensitive to token frequency because it only records the smallest hash per function.

Locality‑Sensitive Hashing (LSH) for Scalable Deduplication

Pairwise similarity computation is O(n²) and infeasible for massive corpora. LSH reduces complexity by partitioning each document embedding into K+1 bands, hashing each band, and placing documents with identical band hashes into the same bucket. Documents that share at least one bucket are then compared directly.

The probability that two documents with similarity s share a bucket is 1 - (1 - s^r)^b, where r is the rows per band and b is the number of bands. Adjusting b balances recall and computational cost.

Spark‑Based Minhash Deduplication Pipeline

The following Spark pipeline demonstrates a practical large‑scale deduplication workflow:

from argparse import ArgumentParser
from pyspark.sql import SparkSession

parser = ArgumentParser()
parser.add_argument('data', nargs=1, type=str)
parser.add_argument('--num-gpus', type=int, default=800)
parser.add_argument('--threshold', type=float, default=0.9)
args = parser.parse_args()

input_path = ','.join(args.data)
output_path = f"{input_path.rstrip('/')}_result"
qs_temp_path = f"{output_path}_tmp_qs"
minhash_temp_path = f"{output_path}_tmp_minhash"
dd_temp_path = f"{output_path}_tmp_dd"

spark = SparkSession.builder.appName("LLM Deduplication").getOrCreate()

Quality scores (qs) are computed on GPU‑accelerated models and attached to each record to decide which duplicate to keep. The datasketch library provides Minhash and LSH utilities.

from transformers import AutoTokenizer
from datasketch import MinHash, MinHashLSH

minhash_tokenizer = AutoTokenizer.from_pretrained('Your tokenizer')
LSH = MinHashLSH(threshold=args.threshold, num_perm=200)

# Example N‑gram helper (3‑gram)
class Ngram:
    def __init__(self, n):
        self.n = n
    def __call__(self, tokens):
        return [''.join(tokens[i:i+self.n]) for i in range(len(tokens)-self.n+1)]

gram_3 = Ngram(3)

Each document is tokenized, converted to 3‑grams, and fed into a Minhash object:

def get_minhash(line):
    text = line['text']
    tokens = gram_3(minhash_tokenizer.tokenize(text))
    m = MinHash(num_perm=200)
    m.update_batch([t.encode() for t in tokens])
    line['minhash'] = list(m.digest())
    return line

After computing Minhash signatures, the pipeline splits each signature into bands, hashes each band, and groups documents by band hash to perform pairwise Jaccard similarity checks only within the same bucket.

def get_LSH_keys(hash_list, bands):
    rows = len(hash_list) // bands
    keys = []
    for i in range(bands):
        start = i * rows
        end = (i + 1) * rows
        segment = hash_list[start:end]
        keys.append('band:{}_'.format(i) + md5(','.join(map(str, segment))).hexdigest())
    return keys

Deduplication logic keeps the document with the higher quality score and discards the other when the Jaccard similarity exceeds the threshold.

def jaccard_similarity(sig1, sig2):
    assert len(sig1) == len(sig2)
    return sum(1 for a, b in zip(sig1, sig2) if a == b) / len(sig1)

def deduplicate(group):
    key, docs = group
    docs = sorted(docs, key=lambda d: d['hashindex'])
    docs.sort(key=lambda d: d.get('qs', 0), reverse=True)
    retained = []
    for doc in docs:
        if not retained:
            retained.append(doc)
            continue
        for r in retained:
            if jaccard_similarity(r['minhash'], doc['minhash']) >= args.threshold:
                doc['sim_doc'] = r['hashindex']
                break
        if 'sim_doc' not in doc:
            retained.append(doc)
        else:
            yield {'hashindex': doc['hashindex'], 'sim_doc': doc['sim_doc']}

Finally, the results are merged across all bands, broadcast as a dictionary, and used to filter the original RDD, producing a deduplicated dataset.

dedup_dict = sc.broadcast(sc.textFile(dd_temp_path)
    .map(lambda x: json.loads(x))
    .map(lambda x: (x['hashindex'], x['sim_doc']))
    .collectAsMap())

output_rdd = minhash_rdd.filter(lambda x: x['hashindex'] not in dedup_dict.value)
output_rdd.map(json.dumps).saveAsTextFile(output_path)

Conclusion

Data deduplication is essential for LLM pre‑training because redundant documents can cause over‑representation of certain facts while wasting compute resources. Simhash excels at capturing subtle token‑level changes, whereas Minhash is robust to token frequency and works well with Jaccard similarity. Combining either algorithm with LSH dramatically reduces the pairwise comparison cost, enabling scalable deduplication on billions of documents using Spark.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python LLM Spark data deduplication SimHash Minhash

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.