Information Security 17 min read

Overview of Web Crawling, Anti‑Crawling Techniques, and 58 Anti‑Crawling System

This article introduces the fundamentals of web crawlers, typical crawling methods, and a comprehensive set of anti‑crawling strategies—including IP control, browser and device simulation, CAPTCHA cracking, and traffic analysis—while detailing the architecture and capabilities of the 58 anti‑crawling platform.

58 Tech

May 8, 2019

Overview of Web Crawling, Anti‑Crawling Techniques, and 58 Anti‑Crawling System

0x00 Introduction

Web crawlers, also known as spiders or bots, simulate network protocols to retrieve target data on a large scale over long periods. A basic crawler starts from a single link, continuously collecting pages and expanding to newly discovered URLs, while focused crawlers target specific content structures.

Crawlers increase server load and can expose sensitive resources such as real‑estate listings, recruitment data, or used‑car information. Exploiting business‑logic or system vulnerabilities, crawlers may also harvest user, merchant, or platform data, leading to information‑leakage incidents and legal issues.

0x01 Search Engines

Major search engines (Google, Baidu, 360, Bing) use crawlers that identify themselves via User‑Agent strings, e.g., Baidu PC UA:

Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

. Since UA strings can be forged, relying solely on them is insufficient; host verification and behavioral analysis are also needed.

Robots Protocol

The Robots Exclusion Protocol (Robots.txt) tells crawlers which pages may be accessed or must be avoided. Although legitimate crawlers respect it, the protocol is voluntary and cannot enforce compliance.

0x02 Typical Crawling Techniques

Crawlers are produced by various actors: students and hobbyists, data‑service companies, commercial competitors, and uncontrolled bots running on compromised servers. Python is the most common language, with libraries such as Scrapy, BeautifulSoup, pyquery, and Mechanize.

Data‑service companies offer custom data sets and crawling services. Competitors may scrape each other’s platforms for competitive analysis. Uncontrolled bots may reside on cloud servers or infected machines, operating without supervision.

Setting Request Frequency

Limiting crawl frequency reduces server load, but sophisticated crawlers randomize sleep intervals to evade simple rate‑limiting.

Proxy IPs

Crawlers often use multi‑threaded, distributed approaches with rotating proxy IPs—free or paid—to bypass IP‑based blocks and to overcome CAPTCHA challenges by changing IP addresses.

Browser Spoofing

By randomizing User‑Agent strings or using full browser automation (e.g., headless Chrome, PhantomJS), crawlers can evade UA‑based detection. Some tools (Octoparse, Firefly) embed real browser engines to pass advanced checks.

Device Simulation

Device fingerprints (JS‑generated or SDK‑based) uniquely identify browsers or apps. Anti‑crawling can combine IP and fingerprint data, but simulated fingerprints can also be generated to bypass checks.

CAPTCHA Cracking

CAPTCHAs are a primary barrier; attackers use manual solving, machine‑learning recognition, or third‑party solving services to bypass them.

Network Parameter Forgery

Advanced crawlers may set or forge cookies, Referer headers, and other HTTP parameters to mimic genuine traffic.

0x03 Common Anti‑Crawling Countermeasures

IP Controls

Rate‑limit per IP, with granularity for time windows, regions, page types, and protocol variations.

Browser Detection

Inspect User‑Agent, plugin list, language, WebGL, and other browser‑specific properties to differentiate real browsers from bots.

Network Parameter Checks

Validate cookies, Referer, and other headers; distinguish between WEB, APP, and mobile clients.

CAPTCHA Enforcement

Deploy image, sliding‑puzzle, click, SMS, or voice CAPTCHAs, possibly combined with behavioral biometrics.

Device Fingerprinting

Collect SDK‑based or JS‑based fingerprints to detect emulators, rooted devices, or repeated identifiers.

Web‑Side Techniques

Use JS obfuscation, encrypted scripts, asynchronous Ajax/Fetch calls, hidden or dummy links, CSS tricks, IFRAME loading, and dynamic HTML changes to hinder scraping.

Behavioral Analysis

Compare access patterns such as localStorage usage, request bursts, and parameter traversal to distinguish bots from human users.

API Rate Limiting

Set per‑IP or per‑fingerprint thresholds, encrypt API payloads, and embed data‑level monitoring.

Account‑Based Controls

Enforce login requirements, limit per‑account request frequency, device count, and geographic access.

Security Portraits

58’s security portrait service combines big‑data threat intelligence with risk‑control to provide pre‑alert, real‑time detection, and post‑incident forensics, integrating multiple risk tags (IP, device, account, phone).

0x04 58 Anti‑Crawling System Overview

The 58 Anti‑Crawling SCF service offers low‑cost, rapid integration, handling nearly 1 billion requests daily with a baseline throughput of ~10 k RPS and an average latency of 0.5 ms. It covers real‑estate, recruitment, classifieds, and related business lines.

Clients connect via the SCF gateway; a strategy management system configures rule sets; an analysis engine executes strategies and forwards hits to a decision engine; real‑time monitoring and a big‑data platform provide analytics.

The strategy management system enables batch automation of generic policies, while the real‑time monitoring module alerts on abnormal traffic.

Risk penalties consider dimensions such as UID, cookie, IP, and device fingerprint.

Interception methods include various CAPTCHAs, fake data responses, and interrupt pages.

0x05 Anti‑Crawling Traffic Analysis Platform

Traffic analysis based on Nginx logs identifies malicious crawlers, bots, and simulators across PC, mobile, and app channels, providing alerts, target identification, and trend monitoring. It offers heat‑maps for domains, interfaces, and business lines.

Further analyses include domain feature extraction, IP/UA/URL ranking, and future extensions for finer‑grained statistics and risk output.

0x06 Conclusion

This document covered crawler basics, common crawling techniques, anti‑crawling countermeasures, and an overview of 58’s anti‑crawling capabilities. Continuous innovation and close alignment with business scenarios are essential for staying ahead of evolving crawling threats.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

information security bot detection Traffic analysis anti‑crawling web crawling

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.