Understanding Simhash: From Traditional Hash to Random Projection LSH
This article explains the principles and implementation of Simhash, covering the shortcomings of traditional hash functions, the use of cosine similarity, random projection for dimensionality reduction, locality‑sensitive hashing, and practical optimizations for large‑scale duplicate detection.
Preface
It is well known that the WeChat public account platform has high commercial value because of its strong original‑content protection; when you try to claim another article as original, WeChat returns an error indicating the originality check failed.
If you sniff the traffic you will see an error message similar to the one shown below.
Changing a few characters to evade detection does not work; the failure is due to WeChat’s originality detection which uses the Simhash algorithm, originally invented by Google for large‑scale web deduplication. Simhash generates a fingerprint for each article and compares fingerprints to assess similarity. The article will reveal how Simhash works.
The article’s structure is as follows:
Traditional Hash and Its Limitations
Cosine Theorem Implementation and Its Limitations
Dimensionality Reduction via Random Projection
Simhash Principle and Implementation
Traditional Hash and Its Limitations
To compare two articles for equality, a common approach is:
Hash the article with a function such as MD5 to obtain a fixed‑length string (e.g., 32 characters).
Compare the two fixed‑length strings for equality.
The first step maps a large space to a small one, producing a “fingerprint”. However, traditional hash functions are designed for high randomness: even a single character change yields a completely different hash, making them unsuitable for similarity measurement.
For example, SHA1 of the Chinese strings “我是中国人” and “我是中国人啊” differs entirely despite only one extra character, so direct comparison is impossible and would require character‑by‑character comparison, which is inefficient.
We need a hash function that satisfies two conditions:
Supports local similarity.
Produces a result that is easy to compare.
To make comparison easy, the hash result can be converted to a fixed‑length binary string (0/1). By XOR‑ing two such strings, the number of 1s indicates the number of differing bits—this is the core idea of Simhash.
As shown, after XOR only two bits are 1, meaning only two positions differ.
Next, we examine how Simhash yields locally similar results. Its computation resembles cosine similarity, so we first review cosine similarity.
Cosine Theorem
The concept first appeared in Wu Jun’s "The Beauty of Mathematics". By representing a document as an n‑dimensional vector, the cosine of the angle between two vectors indicates similarity.
句子A:我喜欢看电视,不喜欢看电影。
句子B:我不喜欢看电视,也不喜欢看电影。Step 1: Tokenization
句子A:我/喜欢/看/电视,不/喜欢/看/电影。
句子B:我/不/喜欢/看/电视,也/不/喜欢/看/电影。Step 2: List all keywords.
我,喜欢,看,电视,电影,不,也。Side note: Use TF‑IDF to filter out stop words such as “的”, “地”, “得”.
Step 3: Compute term frequencies.
句子A:我 1,喜欢 2,看 2,电视 1,电影 1,不 1,也 0。
句子B:我 1,喜欢 2,看 2,电视 1,电影 1,不 2,也 1。Step 4: Form frequency vectors.
句子A:[1, 2, 2, 1, 1, 1, 0]
句子B:[1, 2, 2, 1, 1, 2, 1]Note: In practice TF‑IDF weights are used instead of raw counts.
The similarity problem becomes computing the cosine of the angle between the two vectors; a cosine close to 1 means a small angle and high similarity.
The cosine formula is:
Applying the formula yields a cosine of 0.938, i.e., 93.8% similarity, which matches intuition. However, cosine similarity requires many multiplications and square‑roots on high‑dimensional vectors, making it unsuitable for massive web‑scale deduplication.
Dimensionality Reduction via Random Projection
Meaning of Vector Dot Product
Random projection relies on vector dot products. Understanding dot product is essential.
In a 2‑D space, the dot product of two vectors is defined as:
It represents the product of the lengths of the projections of one vector onto the other.
Assume two vectors as shown:
Projecting vector A onto a new axis reduces its dimensionality from 2‑D to 1‑D. In general, a random Gaussian vector can project an M‑dimensional vector down to K dimensions while approximately preserving angles (Johnson–Lindenstrauss lemma).
Random Projection Dimensionality Reduction & LSH
After random projection, the resulting coordinates are often floating‑point numbers, which are costly to store and compute. By discretizing each coordinate to 0 or 1 (negative values become 0, non‑negative become 1), we obtain a binary representation suitable for fast similarity checks.
This discretization is known as Locality‑Sensitive Hashing (LSH) based on random projection. The binary vectors retain proximity relationships while enabling rapid Hamming‑distance calculations.
Random Hyperplane Hash
Random hyperplane hashing is a variant of the above LSH. For an n‑dimensional vector v, to obtain an f‑bit signature (f ≪ n):
Randomly generate f n‑dimensional vectors r₁…r_f.
For each r_i, if the dot product v·r_i > 0, set the i‑th bit to 1; otherwise set it to 0.
This effectively places f random hyperplanes in space; each hyperplane splits the space into two halves, assigning a 1 or 0 depending on which side v lies.
Illustration: points falling into different regions receive different binary signatures such as 110.
The final f‑bit signature is the article’s fingerprint; comparing fingerprints via XOR reveals the number of differing bits, i.e., the Hamming distance.
Simhash Principle and Implementation
Simhash builds on random hyperplane hashing. Its generation process consists of five steps: tokenization → hashing → weighting → aggregation → dimensionality reduction.
Tokenization : Split the text into meaningful words and assign a weight (e.g., 1‑5) to each term.
Hash : Hash each term to a binary code (e.g., "美国" → 100101, "51区" → 101011). Convert 0 to –1 to spread vectors across quadrants.
Weighting : Multiply each bit of the term’s hash by its weight, producing weighted vectors.
Aggregation : Sum the weighted vectors element‑wise, yielding a single vector.
Dimensionality Reduction : Convert each element to 1 if positive, otherwise 0, forming the final Simhash signature.
Example (simplified to two terms):
100101 → 1‑1‑1‑0‑1‑1
101011 → 1‑0‑1‑0‑1‑1Weighted vectors (weights 4 and 5) become:
美国: 4 -4 -4 4 -4 4
51区: 5 -5 5 -5 5 5Aggregating yields (9 -9 1 -1 1 9); after reduction the final 6‑bit Simhash is 101011.
In practice Simhash signatures are 64 bits; two documents are considered similar if their Hamming distance ≤ 3.
Simhash Query Optimization
Directly XOR‑comparing a 64‑bit signature against billions of entries is infeasible. Using the pigeonhole principle, the signature can be split into four 16‑bit blocks. If two signatures differ in ≤ 3 bits, at least one block must match exactly.
Side note: The pigeonhole principle states that placing three apples into four drawers guarantees at least one empty drawer.
We store each article as four key‑value pairs: each 16‑bit block serves as a key (K) and the remaining three blocks as the value (V). During a query we first look up exact matches on K (O(1) time). For each matching K we then compare the corresponding V’s, reducing the number of full‑signature comparisons dramatically.
Assuming a database of 2³⁰ (~1 billion) entries:
Without the pigeonhole trick: ~1 billion comparisons.
With the trick: up to 4 × 2¹⁴ ≈ 65 000 comparisons.
This space‑for‑time trade‑off reduces query time by several orders of magnitude at the cost of roughly four‑fold increased storage.
Simhash Drawbacks
Simhash works best for massive, long texts. For short texts the Hamming‑distance threshold of 3 often yields false negatives, which is why WeChat requires articles to exceed 300 characters for originality claims.
Conclusion
The key to mastering Simhash is understanding random hyperplane hashing, which enables high‑dimensional vectors to be projected into low‑dimensional binary fingerprints. Many existing tutorials skip the dimensionality‑reduction step, leaving readers confused. This article provides a step‑by‑step walkthrough and references for deeper study.
https://www.cnblogs.com/shaosks/p/9121774.html
局部敏感哈希算法及其思想的讨论:https://my.oschina.net/u/4367429/blog/3261406
http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html
https://zhuanlan.zhihu.com/p/81026564
http://www.hanting.tech/2017/05/23/simhash.html
https://zhuanlan.zhihu.com/p/92155250
https://blog.csdn.net/sunny_ss12/article/details/46949315
https://cloud.tencent.com/developer/article/1189493
https://www.cnblogs.com/sddai/p/10088007.html
https://www.cnblogs.com/hxsyl/p/4518506.html
https://www.iyunying.org/seo/dataanalysis/152232.html
https://cloud.tencent.com/developer/article/1390215
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.