How Simhash and Minhash Power LLM Data Deduplication: Theory and Spark Code
This article explains document‑level, paragraph‑level, and sentence‑level deduplication for large‑scale LLM pre‑training, introduces the Simhash and Minhash algorithms with step‑by‑step Python examples, and shows how to implement efficient LSH‑based deduplication using Spark.
Simhash Algorithm
Simhash starts by tokenizing a document with a BPE tokenizer (e.g., Qwen, LLaMA, or OpenAI's tokenizer). Each token is hashed using MD5, converted to a 128‑bit binary string, and mapped to a vector where bits equal to 1 become 1 and bits equal to 0 become -1. The document vector is obtained by summing all token vectors and then clipping the result to the range 0 – 1. Similarity between two documents is measured by the Hamming distance of their clipped vectors.
from hashlib import md5
import numpy as np
def hash(text):
return md5(text.encode()).hexdigest()
words = ['不能','复现','的','软件','不算','开源软件']
for word in words:
print(f"{word}:{hash(word)}")
def encode_word(word_hash):
bin_hash = bin(int(word_hash, 16))[2:].zfill(128)
return np.array([1 if bit == '1' else -1 for bit in bin_hash])
word_embeddings = [encode_word(hash(w)) for w in words]
print(np.clip(sum(word_embeddings), 0, 1))The clipping step ensures that the final vector contains only 0 and 1 values, making Hamming distance a suitable similarity metric. Simhash works well because adding or removing a few tokens only slightly perturbs the summed vector, keeping the clipped result stable for near‑duplicate documents.
Minhash Algorithm
Minhash shares the first two steps with Simhash (tokenization and MD5 hashing). The third step differs: for each of N hash functions, the minimum hash value among all tokens becomes one dimension of the document vector. The implementation uses a base MD5 hash transformed by linear functions to generate many pseudo‑independent hash functions.
def hash_new(text, a, b, prime=4294967311):
return (int(hash(text), 16) * a + b) % prime
import random
random.seed(3407)
N = 200
A = [random.randint(1, (1 << 32) - 1) for _ in range(N)]
B = [random.randint(1, (1 << 32) - 1) for _ in range(N)]
doc_embedding = []
for i in range(N):
doc_embedding.append(min([hash_new(word, A[i], B[i]) for word in words]))
print(doc_embedding)Similarity is computed as the Jaccard similarity between the sets of minima from two documents. Minhash is less sensitive to token frequency because it only records the smallest hash per function.
Locality‑Sensitive Hashing (LSH) for Scalable Deduplication
Pairwise similarity computation is O(n²) and infeasible for massive corpora. LSH reduces complexity by partitioning each document embedding into K+1 bands, hashing each band, and placing documents with identical band hashes into the same bucket. Documents that share at least one bucket are then compared directly.
The probability that two documents with similarity s share a bucket is 1 - (1 - s^r)^b, where r is the rows per band and b is the number of bands. Adjusting b balances recall and computational cost.
Spark‑Based Minhash Deduplication Pipeline
The following Spark pipeline demonstrates a practical large‑scale deduplication workflow:
from argparse import ArgumentParser
from pyspark.sql import SparkSession
parser = ArgumentParser()
parser.add_argument('data', nargs=1, type=str)
parser.add_argument('--num-gpus', type=int, default=800)
parser.add_argument('--threshold', type=float, default=0.9)
args = parser.parse_args()
input_path = ','.join(args.data)
output_path = f"{input_path.rstrip('/')}_result"
qs_temp_path = f"{output_path}_tmp_qs"
minhash_temp_path = f"{output_path}_tmp_minhash"
dd_temp_path = f"{output_path}_tmp_dd"
spark = SparkSession.builder.appName("LLM Deduplication").getOrCreate()Quality scores (qs) are computed on GPU‑accelerated models and attached to each record to decide which duplicate to keep. The datasketch library provides Minhash and LSH utilities.
from transformers import AutoTokenizer
from datasketch import MinHash, MinHashLSH
minhash_tokenizer = AutoTokenizer.from_pretrained('Your tokenizer')
LSH = MinHashLSH(threshold=args.threshold, num_perm=200)
# Example N‑gram helper (3‑gram)
class Ngram:
def __init__(self, n):
self.n = n
def __call__(self, tokens):
return [''.join(tokens[i:i+self.n]) for i in range(len(tokens)-self.n+1)]
gram_3 = Ngram(3)Each document is tokenized, converted to 3‑grams, and fed into a Minhash object:
def get_minhash(line):
text = line['text']
tokens = gram_3(minhash_tokenizer.tokenize(text))
m = MinHash(num_perm=200)
m.update_batch([t.encode() for t in tokens])
line['minhash'] = list(m.digest())
return lineAfter computing Minhash signatures, the pipeline splits each signature into bands, hashes each band, and groups documents by band hash to perform pairwise Jaccard similarity checks only within the same bucket.
def get_LSH_keys(hash_list, bands):
rows = len(hash_list) // bands
keys = []
for i in range(bands):
start = i * rows
end = (i + 1) * rows
segment = hash_list[start:end]
keys.append('band:{}_'.format(i) + md5(','.join(map(str, segment))).hexdigest())
return keysDeduplication logic keeps the document with the higher quality score and discards the other when the Jaccard similarity exceeds the threshold.
def jaccard_similarity(sig1, sig2):
assert len(sig1) == len(sig2)
return sum(1 for a, b in zip(sig1, sig2) if a == b) / len(sig1)
def deduplicate(group):
key, docs = group
docs = sorted(docs, key=lambda d: d['hashindex'])
docs.sort(key=lambda d: d.get('qs', 0), reverse=True)
retained = []
for doc in docs:
if not retained:
retained.append(doc)
continue
for r in retained:
if jaccard_similarity(r['minhash'], doc['minhash']) >= args.threshold:
doc['sim_doc'] = r['hashindex']
break
if 'sim_doc' not in doc:
retained.append(doc)
else:
yield {'hashindex': doc['hashindex'], 'sim_doc': doc['sim_doc']}Finally, the results are merged across all bands, broadcast as a dictionary, and used to filter the original RDD, producing a deduplicated dataset.
dedup_dict = sc.broadcast(sc.textFile(dd_temp_path)
.map(lambda x: json.loads(x))
.map(lambda x: (x['hashindex'], x['sim_doc']))
.collectAsMap())
output_rdd = minhash_rdd.filter(lambda x: x['hashindex'] not in dedup_dict.value)
output_rdd.map(json.dumps).saveAsTextFile(output_path)Conclusion
Data deduplication is essential for LLM pre‑training because redundant documents can cause over‑representation of certain facts while wasting compute resources. Simhash excels at capturing subtle token‑level changes, whereas Minhash is robust to token frequency and works well with Jaccard similarity. Combining either algorithm with LSH dramatically reduces the pairwise comparison cost, enabling scalable deduplication on billions of documents using Spark.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
