Fundamentals 18 min read

Understanding Simhash: From Traditional Hash to Random Projection and LSH

This article explains the principles behind Simhash, covering the shortcomings of traditional hash functions, the use of cosine similarity, random projection for dimensionality reduction, locality‑sensitive hashing, random hyperplane hashing, implementation steps, query optimization with the pigeonhole principle, and the algorithm's limitations in short‑text scenarios.

Architect

Oct 18, 2021

Understanding Simhash: From Traditional Hash to Random Projection and LSH

It is widely known that the WeChat public platform uses a strict originality verification mechanism; when an article fails the check, WeChat returns an error indicating that the content does not pass the original‑content validation. The error originates from the Simhash technique, which Google invented for large‑scale web deduplication and is now used in many large‑scale similarity‑detection systems.

Traditional hash functions (e.g., MD5, SHA1) map a document to a fixed‑length string, but they cannot express partial similarity because even a single character change produces a completely different hash. To compare documents, we need a hash that preserves local similarity and yields results that are easy to compare.

Simhash achieves this by converting the hash result into a binary vector of 0s and 1s; the Hamming distance between two vectors directly reflects document similarity.

The article then introduces cosine similarity: by representing each document as an n‑dimensional vector (using TF‑IDF or simple term frequencies) and computing the angle between vectors, we obtain a similarity score. However, cosine similarity is computationally expensive for massive corpora because it requires high‑dimensional vector operations.

Random projection offers a solution: it reduces an n‑dimensional vector to a lower‑dimensional space while approximately preserving angles, as guaranteed by the Johnson–Lindenstrauss lemma. The basic operation is the vector dot product, and random vectors drawn from a Gaussian distribution define the projection.

After projection, the resulting floating‑point coordinates are discretized to binary values (0 or 1) by assigning negative components to 0 (or –1) and non‑negative components to 1. This discretization yields a locality‑sensitive hash (LSH) that can be compared with simple XOR operations.

Random hyperplane hashing is a specific LSH: for each of f random hyperplanes, the sign of the dot product between the document vector and the hyperplane determines a bit of the signature. The final f‑bit signature is the Simhash fingerprint.

Simhash generation consists of five steps: tokenization, hashing each token, weighting, aggregation, and dimensionality reduction. The article provides a concrete example with two tokens (“USA” and “Area 51”), showing how token hashes are converted to ±1 vectors, weighted, summed, and finally binarized to produce the 6‑bit Simhash shown below.

<code style="padding: 16px; color: black; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px"> 100101  ----> 1-1-11-11<br/> 101011  ----> 1-11-111<br/></code>

The resulting 64‑bit Simhash can be compared using Hamming distance; a distance of three bits or fewer typically indicates similarity. To avoid exhaustive pairwise XOR on billions of signatures, the pigeonhole principle is applied: the 64‑bit fingerprint is split into four 16‑bit blocks, and any matching block serves as a key (K) for a hash table, dramatically reducing the number of comparisons from ~10⁹ to ~6×10⁴ while increasing storage by a factor of four.

Despite its efficiency for massive long texts, Simhash performs poorly on short texts because the Hamming‑distance threshold of three bits is often too strict; this explains why WeChat requires articles to exceed 300 characters for originality claims.

In summary, understanding Simhash hinges on grasping random‑hyperplane LSH, which enables high‑dimensional vectors to be compressed into compact binary fingerprints suitable for large‑scale similarity detection.

References

https://www.cnblogs.com/shaosks/p/9121774.html

https://my.oschina.net/u/4367429/blog/3261406

http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html

https://zhuanlan.zhihu.com/p/81026564

http://www.hanting.tech/2017/05/23/simhash.html

https://zhuanlan.zhihu.com/p/92155250

https://blog.csdn.net/sunny_ss12/article/details/46949315

https://cloud.tencent.com/developer/article/1189493

https://www.cnblogs.com/sddai/p/10088007.html

https://www.cnblogs.com/hxsyl/p/4518506.html

https://www.iyunying.org/seo/dataanalysis/152232.html

https://cloud.tencent.com/developer/article/1390215

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.