Understanding Simhash: From Traditional Hash to Random Projection and LSH
This article explains the principles behind Simhash, covering the shortcomings of traditional hash functions, the use of cosine similarity, random projection for dimensionality reduction, locality‑sensitive hashing, random hyperplane hashing, implementation steps, query optimization with the pigeonhole principle, and the algorithm's limitations in short‑text scenarios.
It is widely known that the WeChat public platform uses a strict originality verification mechanism; when an article fails the check, WeChat returns an error indicating that the content does not pass the original‑content validation. The error originates from the Simhash technique, which Google invented for large‑scale web deduplication and is now used in many large‑scale similarity‑detection systems.
Traditional hash functions (e.g., MD5, SHA1) map a document to a fixed‑length string, but they cannot express partial similarity because even a single character change produces a completely different hash. To compare documents, we need a hash that preserves local similarity and yields results that are easy to compare.
Simhash achieves this by converting the hash result into a binary vector of 0s and 1s; the Hamming distance between two vectors directly reflects document similarity.
The article then introduces cosine similarity: by representing each document as an n‑dimensional vector (using TF‑IDF or simple term frequencies) and computing the angle between vectors, we obtain a similarity score. However, cosine similarity is computationally expensive for massive corpora because it requires high‑dimensional vector operations.
Random projection offers a solution: it reduces an n‑dimensional vector to a lower‑dimensional space while approximately preserving angles, as guaranteed by the Johnson–Lindenstrauss lemma. The basic operation is the vector dot product, and random vectors drawn from a Gaussian distribution define the projection.
After projection, the resulting floating‑point coordinates are discretized to binary values (0 or 1) by assigning negative components to 0 (or –1) and non‑negative components to 1. This discretization yields a locality‑sensitive hash (LSH) that can be compared with simple XOR operations.
Random hyperplane hashing is a specific LSH: for each of f random hyperplanes, the sign of the dot product between the document vector and the hyperplane determines a bit of the signature. The final f‑bit signature is the Simhash fingerprint.
Simhash generation consists of five steps: tokenization, hashing each token, weighting, aggregation, and dimensionality reduction. The article provides a concrete example with two tokens (“USA” and “Area 51”), showing how token hashes are converted to ±1 vectors, weighted, summed, and finally binarized to produce the 6‑bit Simhash shown below.
100101 ----> 1-1-11-11
101011 ----> 1-11-111The resulting 64‑bit Simhash can be compared using Hamming distance; a distance of three bits or fewer typically indicates similarity. To avoid exhaustive pairwise XOR on billions of signatures, the pigeonhole principle is applied: the 64‑bit fingerprint is split into four 16‑bit blocks, and any matching block serves as a key (K) for a hash table, dramatically reducing the number of comparisons from ~10⁹ to ~6×10⁴ while increasing storage by a factor of four.
Despite its efficiency for massive long texts, Simhash performs poorly on short texts because the Hamming‑distance threshold of three bits is often too strict; this explains why WeChat requires articles to exceed 300 characters for originality claims.
In summary, understanding Simhash hinges on grasping random‑hyperplane LSH, which enables high‑dimensional vectors to be compressed into compact binary fingerprints suitable for large‑scale similarity detection.
References
https://www.cnblogs.com/shaosks/p/9121774.html
https://my.oschina.net/u/4367429/blog/3261406
http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html
https://zhuanlan.zhihu.com/p/81026564
http://www.hanting.tech/2017/05/23/simhash.html
https://zhuanlan.zhihu.com/p/92155250
https://blog.csdn.net/sunny_ss12/article/details/46949315
https://cloud.tencent.com/developer/article/1189493
https://www.cnblogs.com/sddai/p/10088007.html
https://www.cnblogs.com/hxsyl/p/4518506.html
https://www.iyunying.org/seo/dataanalysis/152232.html
https://cloud.tencent.com/developer/article/1390215
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.