Design and Implementation of the Cleaner Anti‑Crawler System for Real‑Time Threat Mitigation
The article presents a comprehensive design of the Cleaner anti‑crawler system, detailing its background, current challenges, related research, system architecture—including data processing, ban center, and ban store modules built on Flink, MQ, and Redis—implementation details, effectiveness evaluation, and concluding insights on achieving real‑time accuracy in protecting platform data.
1. Background
One day the data team reported a sudden surge in DAU, which was suspected to be caused by a recent media campaign. Mr. G, after inspecting the data, hypothesized that the growth was due to the campaign but also sensed the presence of malicious bots.
2. Current Situation
Mr. G explained that the platform constantly suffers from crawler attacks, which are sometimes blocked by existing defenses and sometimes merely paused, with the latter occurring more often.
"They generally have two characteristics"
Regularity
High frequency
"They generally have two attack modes"
Direct attack : massive requests, rough disguises, focusing on a single interface.
Spy operation : fewer but more regular requests that mimic real users to obtain core interface information.
"Their harms are reflected in four aspects"
Data security : stolen product and user information may lead to loss, leakage, or fraud.
Big‑data anomalies : polluted logs degrade the reliability of DAU, PV, UV statistics.
Service stability : massive requests can overload servers and cause system collapse.
Personalization failure : polluted search queries degrade recommendation accuracy.
The existing anti‑crawler measures (basic legality checks, hourly risk control, manual bans) are insufficient.
3. Prior Research
To counter the identified crawler characteristics, a system is needed that can distinguish normal users from crawlers in real time without heavy reliance on the front end.
"Common backend anti‑crawler methods"
Login restriction : forces login, raising crawler cost but hurting user experience.
Cookie verification : validates identity data in cookies, but can be bypassed if crawlers steal or forge cookies.
Frequency verification : limits request rates per IP or other dimensions; may cause false positives.
CAPTCHA verification : triggers visual or sliding CAPTCHAs after thresholds; effective but requires front‑end cooperation.
Data encryption : encrypts request payloads on the client; still vulnerable if encryption logic is exposed.
Each method has strengths and drawbacks; a single or naïvely combined approach cannot meet both accuracy and user‑experience requirements.
4. Cleaner Anti‑Crawler System
"Basic characteristics of the anti‑crawler system"
Accuracy : detect crawlers while avoiding false bans.
Real‑time : respond within seconds before crawlers harvest data.
"Methods aligned with attack modes"
Direct attack : legality checks, frequency control.
Spy operation : user‑behavior analysis.
The prototype, named Cleaner , consists of three modules.
4.1 System Model
The system includes a Data Processing Center , a Ban Center , and a Ban Store , connected via MQ. User logs, Redis, and local cache support the functionality.
4.1.1 Data Processing Center
Real‑time processing of massive logs is achieved with Flink . Logs are transformed into EntryRequest objects containing interface name, user ID, IP, version, timestamp, etc.
A bounded blocking queue collects these objects; a SecurityMQPuzzler extracts them, aggregates them in a local cache (max 100k keys, 1‑hour TTL), and pushes aggregated batches to MQ when a threshold is reached.
This design ensures real‑time responsiveness while preventing data back‑pressure.
4.1.2 Ban Center
The Ban Center focuses on accuracy. It receives MQ messages, applies a switch for emergency shutdown, and evaluates a set of strategies, each assigning a score. If the total exceeds a penalty threshold, the user is banned.
"Key strategies"
User‑ID legality : validates generation rules to block obvious bots.
Frequency : limits request counts per interface to catch high‑frequency attacks.
EL expression : analyzes request sequences for irregular patterns typical of spy bots.
Ban records : stores banned dimensions for future reference.
Black/white list : manual overrides for critical cases.
Integration with other ban stores : enriches decision making.
"Scoring standard"
Scores are fine‑tuned based on operational experience; the exact threshold is continuously approached.
4.1.3 Ban Store
Banned user dimensions are stored in Redis and mirrored to a local cache for ultra‑low‑latency checks at the gateway.
4.2 Effectiveness
After multiple deployment cycles and parameter adjustments, the system showed clear impact. The chart below compares IP request distributions before and after enabling Cleaner.
When Cleaner was off, many IPs made >200 requests per day. After activation, high‑frequency IPs dropped to zero, while lower‑frequency IPs increased, indicating successful early interception.
5. Summary
The study examined prior anti‑crawler research, identified crawler traits, and designed a practical system with real‑time accuracy. Core features include legality checks, frequency control, and behavior analysis, implemented via a Flink‑based data hub, MQ‑decoupled ban center, and lightweight Redis‑backed ban store.
"Anti‑crawler system"
Basic traits : real‑time, accurate.
Basic functions : legality verification, frequency control, user‑behavior analysis.
Core modules : big‑data processing center, ban strategy center.
Core strategies : legality, frequency, EL expression, black/white list.
Dynamics : evolving strategies and scoring.
Balance : ongoing cat‑and‑mouse game between crawlers and defenders.
Cleaner can ban suspicious users within 10 seconds, blocking millions of high‑risk requests daily and continuously safeguarding platform data.
Author : Gao Mengning, Backend Development, ZhaiZhai Platform.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.