Evolution of Zhihu’s Anti‑Cheat System “Wukong”: Architecture, Strategies, and Lessons Learned
This article chronicles the three‑generation evolution of Zhihu’s anti‑cheat platform Wukong, detailing its business context, spam taxonomy, multi‑layered control methods, architectural redesigns, strategy language improvements, graph‑based risk analysis, and the continuous integration of big‑data and machine‑learning techniques to combat content and behavior spam.
Cheating is a pervasive problem in online services, and effective anti‑cheat systems must both detect and handle malicious content; this article introduces the evolution of Zhihu’s anti‑cheat platform, code‑named “Wukong”, and examines its architectural design and lessons learned.
By May 2018 Zhihu had 160 million registered users and expanded from pure content‑spam control to a broader set of risks—including behavior spam and transaction fraud—covering ten business lines and roughly one hundred features.
Spam is categorized into content spam (traffic‑driving, brand‑promoting, fraudulent, and harassment posts) and behavior spam (likes, follows, thanks, shares, and view‑inflation). The governance approach emphasizes agile, continuous risk discovery and a multi‑dimensional, layered defense.
Three control methods are employed: strategy anti‑cheat (simple, fast rules for head‑line spam), product anti‑cheat (product‑level changes to reduce risk and align user‑spammer incentives), and model anti‑cheat (machine‑learning models with optional human review). Controls are applied in three stages: pre‑risk (education, policy blacklists, early interception), during‑risk (mid‑stage detection and mitigation), and post‑risk (offline analysis, feedback loops, and rule optimization).
Wukong V1 consisted of a pre‑module (synchronous, short‑latency checks using keyword and blacklist filters) and a mid‑module (asynchronous parsing and a series of checkers). The event model records user ID, timestamp, environment (UserAgent, IP, DeviceID, Referrer), target ID, and action type. The strategy engine is horizontally extensible (adding new dimensions such as device or IP) and vertically extensible (time‑based back‑tracking). An example policy is:
action == "ANSWER_CREATE" and len(user(same_register_ip(register_time(same_topics(same_type_events(10)),register_interval=3600)))) >= 3This rule flags answers created within ten minutes on the same topic when at least three users share the same registration IP and were registered within one hour.
MongoDB was chosen for the event store (simple schema, read‑heavy workload) and Redis for caching high‑frequency RPC data.
Wukong V2 addressed steep policy‑authoring learning curves and long deployment cycles by introducing self‑service configuration and a functional‑style policy language with operators such as filter , mapper , and reducer . The optimized policy example reads:
action == "ANSWER_CREATE" and same_type_events(10).filter('same_register_ip').filter('register_time',register_interval=3600).filter('same_topics',mtype='AcceptID').mapper('userid').reducer('diff').length >= 3The new language improves readability, extensibility, and aligns with both rule‑based and ML‑based detection. A visual architecture diagram (see image) illustrates the revamped pipeline.
Strategy deployment was streamlined into five steps: create → test → trial‑run → launch → monitor, with trial‑runs replaying recent data in an isolated environment and monitoring providing metrics such as execution time, error count, and hit volume.
Wukong V3 introduced a Gateway component that intercepts risky requests at the Nginx layer, stores state in Redis, and exposes RPC interfaces for user, IP, and device status updates. Parallelization was enhanced by moving from per‑event to per‑policy dispatch, employing a three‑level queue (event queue → strategy worker → processing queue) to improve throughput.
Cache layers were expanded (MongoDB → Redis → local cache) and image‑based spam detection (OCR, ad, porn, illegal, political) was added via third‑party models. Risk data accumulation now spans content, account, IP, and device dimensions, sourced from strategies, third‑party APIs, and manual labeling.
Back‑tracking capabilities were strengthened using cosine‑similarity and Jaccard metrics on Redis‑cached keywords, tags, and community signals, enabling rapid grouping of similar spam.
The “Zhihu Network Analysis Platform (ZNAP)” leverages a graph model (TinkerPop on HBase) to represent users, devices, IPs, and objects as nodes and interactions as edges. It supports community detection (modularity + fast‑unfolding) and community classification (logistic regression) to surface suspicious clusters.
Additional techniques include Spark‑based text similarity clustering (Jaccard, SimHash), unknown‑word detection via left/right entropy and mutual information for brand‑spam, flow‑word detection using a BiLSTM‑CRF model (≈97% accuracy), and a deep‑learning spam classifier built with dilated convolutions and attention (≈95.6% accuracy).
The final architecture diagram (see image) shows the end‑to‑end pipeline: Gateway → Business Layer → Data Ingestion (RPC/Kafka) → Strategy Decision (pre, mid, post) → Data Storage (event store, risk store, HDFS) → Data Computation (offline ML) → Data Services (external data, model APIs). The system forms a closed loop of model‑strategy‑decision‑control‑evaluation‑improvement, with ongoing enhancements planned.
References include works from Facebook, Twitter, and industry case studies, as well as Zhihu’s own technical blog posts on entropy‑based new‑word mining, cache optimization, and Spark clustering.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.