Big Data 11 min read

Spam Detection on Zhihu Using Text and Behavior Clustering with Jaccard and SimHash on Spark

This article describes how Zhihu combats large‑scale spam by applying text and behavior clustering techniques—using Jaccard similarity, SimHash fingerprinting, and Spark‑based graph partitioning—to efficiently identify and group similar spammy content and actions.

Architecture Digest
Architecture Digest
Architecture Digest
Spam Detection on Zhihu Using Text and Behavior Clustering with Jaccard and SimHash on Spark

Authors: Zhou Aote, Sun Xian, Chen Lei. Source: Zhihu (https://zhuanlan.zhihu.com/p/23385044).

Zhihu’s spammers generate massive amounts of similar or densely clustered spam content and behaviors, prompting the use of clustering methods to discover and group them. Both content‑based and behavior‑based clustering are employed.

Clustering groups similar objects together; good clusters have high intra‑group similarity and low inter‑group similarity.

Similarity measurement is crucial; common algorithms include edit distance, cosine similarity, Jaccard similarity, and Pearson correlation. This work uses text‑based similarity algorithms, primarily Jaccard and SimHash.

Jaccard

Jaccard similarity is the size of the intersection divided by the size of the union of two sets.

SimHash

SimHash, proposed by Charikar (2002) and later used at Google, creates an n‑bit fingerprint for a document; similar texts produce similar fingerprints. The similarity comparison involves computing the Hamming distance between fingerprints.

Tokenize the text and apply TF‑IDF weighting to reduce the impact of stop words.

Hash each token.

Weight the hash bits (1 multiplied by positive weight, 0 by negative weight).

Aggregate weighted hashes column‑wise to obtain a numeric sequence.

Dimensionality reduction: convert the numeric sequence to a binary string (positive numbers to 1, negative to 0).

Compare similarity by calculating the Hamming distance between two SimHash values.

Performance tests on 1,000,000 iterations showed:

Jaccard time: 21772 ms SimHash time: 9981 ms

SimHash is at least twice as fast for short texts, with greater gains for longer texts, making it suitable for large‑scale similarity detection.

Experiments indicate that for a 64‑bit SimHash fingerprint, a Hamming distance threshold k=3 balances recall and precision (~75%). In practice, k=4 is used for higher recall, followed by a secondary Jaccard check to improve accuracy.

Given Zhihu’s daily web‑side write volume (≈10 million actions), a single‑process pairwise comparison is infeasible. Spark is adopted as the distributed computing framework, leveraging in‑memory processing, rich operators, and fault tolerance.

Content Clustering

Data preparation originally used HiveQL and Python scripts, which were inefficient at scale. Spark integrates seamlessly with Hive, converting HiveQL to Spark transformations and actions, achieving at least a ten‑fold speedup.

The clustering implementation builds a similarity graph G=(V,E) where vertices are documents and edge weights are similarity scores (e.g., SimHash Hamming distance). Edges above a threshold (k) form subgraphs whose connected components define clusters.

Initially, GraphX’s connectedComponents API was used, but performance degraded on large datasets. Instead, Spark SQL was employed to find the minimal similar node for each document, effectively grouping nodes into clusters based on transitive similarity.

To reduce expensive Jaccard calculations, SimHash is used for a coarse grouping, followed by Jaccard refinement within each group. Spark’s broadcast variables cache similarity data in memory, minimizing communication overhead.

Content clustering now completes within 1–3 minutes for private messages.

Behavior Clustering

Behavior paths are expressed as text strings combining request path, method, and time interval. The clustering pipeline mirrors content clustering, handling larger volumes (≈300k+ key write actions daily) by clustering per business type, reducing the computational complexity.

Conclusion

Clustering provides an effective, automated alternative to manual spam detection for both content and behavior. The solution runs offline using Spark, with future work aimed at streaming implementations for real‑time detection.

big dataclusteringtext similaritySparkSimhashspam detectionjaccard
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.