Design and Implementation of the Cleaner Anti‑Crawler System for Real‑Time Bot Detection
This article presents a comprehensive design of the Cleaner anti‑crawler system, detailing its background, current challenges, related research, system architecture—including a Flink‑based data processing center, a strategy‑driven ban center, and a lightweight ban store—and evaluates its effectiveness in real‑time bot mitigation.
Background
One day the data team reported a surprising surge in DAU, which G. suspected was caused by a recent media campaign. He wondered whether the growth was genuine, only to hear a voice questioning the return of the "worms"—the bots.
G. straightened up, recognizing the danger of a bot attack, and declared that the platform was under assault by malicious crawlers.
Current Situation
G. explained that the system constantly faces crawler intrusion; sometimes defenses block them, other times the crawlers pause, with the latter occurring more often.
"They generally have two characteristics"
Regularity
High frequency
"They generally have two attack modes"
Direct assault : massive requests, rough disguises, focusing on a single API to overwhelm it.
Spy operation : fewer but more regular requests that mimic real users to harvest core information.
"Their harms are manifested in four aspects"
Data security : stolen product and user data can lead to loss, leakage, and fraud.
Big‑data anomalies : polluted logs distort DAU, PV, UV statistics.
Service stability : flood‑like requests increase server load and may cause crashes.
Personalization failure : polluted search queries degrade recommendation quality.
The existing anti‑crawler measures are basic, consisting of legality checks, hourly risk control, and manual bans, which are insufficient.
Related Work
To counter the identified crawler traits, a system is needed that can differentiate normal users from bots in real time without heavy reliance on the front end.
"Common backend anti‑crawler methods"
Login restriction : forces authentication, raising bot cost but hurting user experience.
Cookie verification : validates generated tokens, ineffective if bots steal or forge cookies.
Frequency verification : limits request rates per IP or other dimensions, risking false positives.
CAPTCHA verification : challenges after thresholds, with sliding CAPTCHAs being most effective yet user‑unfriendly.
Data encryption : encrypts request payloads, but encryption logic in JS can still be reverse‑engineered.
Each method has pros and cons; simply stitching them together cannot meet both accuracy and user‑experience goals.
Cleaner Anti‑Crawler System
The envisioned system should be accurate and real‑time.
"Basic characteristics of the anti‑crawler system"
Accuracy : detect bots while avoiding false bans.
Real‑time : respond within seconds.
"Methods to counter attack modes"
Direct assault : legality checks and frequency control.
Spy operation : user behavior analysis.
The prototype, named Cleaner , consists of three modules.
4.1 System Model
The Cleaner system includes:
Data Processing Center
Ban Center
Ban Store
These modules communicate via MQ, using user logs, Redis, and local cache for implementation.
Data Processing Center leverages Flink for real‑time log aggregation and forwards enriched events to the Ban Center through MQ.
Each request is transformed into an EntryRequest object containing interface name, user ID, IP, version, timestamp, etc.
A RequestCollector gathers these objects into a bounded blocking queue to prevent backlog.
The SecurityMQPuzzler extracts entries from the queue, batches them (up to 100 k keys, one‑hour TTL), and pushes them to MQ.
Cached keys represent dimensions to be evaluated (e.g., user ID or IP); values are ordered lists acting as sliding windows. Once a list reaches a threshold, it is sent for further analysis.
"Summary"
The Data Processing Center ensures real‑time performance by offloading heavy analytics to Flink and decoupling downstream processing via MQ.
4.1.2 Ban Center
The Ban Center focuses on accuracy, applying configurable strategies and scoring rules to incoming MQ events.
"Main strategies"
User ID legality : validates identifier generation rules to block obvious bots.
Frequency : caps request rates per interface to catch high‑frequency attacks.
EL expressions : analyzes request sequences for irregular patterns indicative of stealthy bots.
Ban records : stores banned dimensions for future reference.
Black/white lists : manual overrides for critical cases.
Integration with other ban stores : enriches decisions with external data.
"Scoring standards"
Each strategy contributes a score; when the aggregate exceeds a dynamic threshold, the user is banned.
"Summary"
The Ban Center continuously evolves its strategies to keep pace with evolving crawler tactics.
4.1.3 Ban Store
Banned dimensions are stored in Redis and synchronized to a local cache for ultra‑low‑latency checks at the gateway.
4.2 Results
After deployment, multiple cleaning cycles and parameter adjustments were performed.
Metrics show that after enabling Cleaner, high‑frequency malicious IPs (>200 requests) dropped to zero, while lower‑frequency traffic increased, indicating successful interception.
Conclusion
The study examined prior anti‑crawler research, identified crawler traits, and proposed a practical system featuring real‑time data processing, accurate banning strategies, and dynamic scoring.
"Anti‑crawler system"
Core traits : real‑time, accurate.
Core functions : legality checks, frequency control, behavior analysis.
Core modules : big‑data processing center, ban strategy center.
Core strategies : legality, frequency, EL expression, blacklist/whitelist.
Dynamic aspects : strategy refinement, scoring adjustment.
Balance : ongoing cat‑and‑mouse between crawlers and defenders.
The Cleaner system can ban suspicious users within 10 seconds, blocking millions of high‑risk requests daily.
Author: Gao Mengning, Backend Developer, ZhaiZhai Platform.
转转QA
In the era of knowledge sharing, discover 转转QA from a new perspective.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.