Information Security 10 min read

Design and Features of an Anti‑Crawling Platform for Large‑Scale Services

The article describes the goals, architecture, core functions, and key characteristics of a comprehensive anti‑crawling platform that systematizes strategy management, data cleaning, monitoring, and rapid response to protect APIs and improve data reliability for large‑scale online services.

Qunar Tech Salon

Jul 27, 2018

Design and Features of an Anti‑Crawling Platform for Large‑Scale Services

Personal Introduction: Hongmin Pan joined Qunar in July 2015 and now works in the Front‑End team of the Large Accommodation Division, responsible for crawling‑type analysis, anti‑crawling strategy development, platform development and maintenance, as well as data analysis and cleaning derived from the platform.

Since anti‑crawling is a long‑term effort, systematizing and engineering the anti‑crawling process is essential; a mature platform is needed to manage strategies, match rules, and apply punishments without repeating manual steps.

Systematizing the anti‑crawling platform is a prerequisite for winning the prolonged battle against crawlers.

Anti‑Crawling Platform Goals

Before building the platform, we must clarify its objectives:

· Provide fast on/off strategy capabilities.

· Allow personalized configuration of strategy selection.

· Offer customizable punishment‑rule matching.

· Supply pools of strategies, rules, and punishments for selection.

· Deliver a stable anti‑crawling system.

· Include comprehensive monitoring.

· Generate valuable identification logs.

· Ensure data protection.

· Enable rapid response to crawlers.

Overall System Architecture

The platform provides detailed anti‑crawling identification logs, which serve two purposes: first, cleaning the logs to obtain reliable data that truly reflects user behavior; second, using the logs as training data for machine‑learning models to achieve automated anti‑crawling.

The platform consists of four components: a business‑line integration system, the anti‑crawling system, a management system, and a configuration‑management system. The integration system connects business lines and offers environment detection; the anti‑crawling system controls the overall workflow and proxy forwarding; the management system handles all administrative tasks such as business line info, strategy, punishment rule, and punishment plan management; the configuration‑management system maintains versioned texts of strategies and punishments.

Previously, anti‑crawling was divided into front‑end, middle‑layer, and back‑end approaches, as well as real‑time and non‑real‑time categories. The anti‑crawling system in this platform belongs to the middle‑layer real‑time category, intercepting requests between client and server to reduce server load.

If the platform is likened to an anti‑missile system, the configuration‑management system manages the missiles, the anti‑crawling system acts as the launch system, the management system installs the missiles, and the integration system defines the protected area and performs external reconnaissance. Together they form a complete anti‑crawling solution.

Platform Functions

The platform protects interfaces with the following functions:

1. Crawler Identification: Basic function of any anti‑crawling system. Strategies stored in a strategy pool enable targeted detection of various crawlers.

2. Proxy Forwarding: Requests that pass the anti‑crawling strategies are forwarded to the real backend servers.

3. Data Protection: Returned data is obfuscated and encrypted, preventing crawlers from directly accessing raw interface data.

4. Rapid Strategy Updates: Strategies can be quickly replaced when compromised, providing a modular defensive wall that can be swiftly reconfigured.

5. Beta‑Environment Customization: Supports multiple beta environments with customizable proxy targets to meet diverse business line testing needs.

6. Online Maintenance of Basic Information: Management system maintains basic interface information for each business line.

7. Data Cleaning: Uses anti‑crawling logs to clean business logs, distinguishing normal users, benign crawlers, and malicious crawlers.

Platform Characteristics

Key characteristics include:

1. Low‑Cost Business Integration: Designed for ease of use, minimizing integration effort for business teams.

2. Customizable Anti‑Crawling Solutions: Allows business lines to define private anti‑crawling strategies alongside public ones.

3. Fast Strategy On/Off: Provides a framework for rapid replacement of ineffective strategies.

4. Test‑Environment Customization: Offers a UI‑driven, simple, and quick way to customize testing environments without adding burden.

Conclusion

The platform currently supports PC and H5 page integration, with future plans for client‑side pages and WeChat mini‑programs. It not only blocks crawlers but also cleans user data, delivering reliable, authentic data to business lines.

As the saying goes, "plan before action" – winning the anti‑crawling war requires a well‑built platform. Anticraw is a complete, feature‑rich anti‑crawling platform that supports multiple business lines, reduces manual effort, reacts quickly to new or updated crawlers, enables fast strategy updates, and simultaneously blocks multiple strategies to avoid single‑point failures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend Platform security data cleaning anti‑crawling

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.