Understanding Web Crawlers: Definitions, Types, Traffic, and Harm
This article introduces web crawlers, classifies them by technology and intent, presents statistics on crawler traffic across industries and regions, and analyzes the various harms they cause, laying the groundwork for future discussions on anti‑crawling strategies.
Personal introduction: Pan Hongmin joined Qunar in July 2015 and now works in the large‑accommodation division’s front‑end team, responsible for crawler analysis, anti‑crawling strategy development, platform construction and maintenance, as well as data analysis and cleaning.
"Use data to explain the situation" is a common leadership mantra, but the article questions whether data truly reflects reality, especially as crawler traffic grows and pollutes statistics, prompting many companies to adopt anti‑crawling measures.
The author notes that existing online discussions about crawlers and anti‑crawlers are fragmented, so this three‑part series will first systematically introduce crawlers, then discuss anti‑crawling and its current state, and finally present their anti‑crawling platform.
Crawler Definition
A crawler is a program that sends requests to a website, obtains resources, and extracts useful data; anti‑crawling refers to techniques that identify crawlers by unique characteristics.
Crawler Classification and Characteristics
Different clients (PC, mobile touch, app, WeChat mini‑program, RN, etc.) lead to different crawler types. This article focuses on PC and touch crawlers, as app and mini‑program crawlers require separate discussion.
Before AJAX, crawlers simply fetched page content; after AJAX, crawlers need JavaScript execution and page rendering capabilities, or they must locate key APIs to request data directly. AJAX thus marks a watershed in crawler evolution.
Based on whether a crawler generates a rendered page tree, three main categories are identified: pure API crawlers, page crawlers, and hybrid crawlers.
Pure API crawlers access key data interfaces or execute simple JS scripts to obtain parameters and fetch real data without rendering.
Page crawlers simulate a browser environment, perform multiple interactions to obtain a rendered page, and then analyze it. They can use headless browsers (e.g., PhantomJS) or driven browsers via WebDriver.
Hybrid crawlers combine the two approaches: they first use page crawling to discover parameters, then hand them to a pure API crawler for data extraction, reverting to page crawling when parameters become invalid.
Pure API crawlers are fast and low‑cost but require strong JS and anti‑crawling knowledge; page crawlers are slower and more expensive but easier to bypass anti‑crawling measures; hybrid crawlers aim to balance speed, cost, and effectiveness.
Crawlers are also classified by intent: benign crawlers (e.g., search engine bots) that increase exposure and revenue, and malicious crawlers that scrape data for competitive advantage, causing losses.
The discussion focuses on malicious crawlers, while benign crawlers are handled separately during data cleaning and identification.
Crawler Traffic
Observations show sudden spikes in website traffic that exceed server capacity, indicating crawler activity. Distil Networks reported that in 2017 over 40% of internet traffic originated from crawlers, with higher percentages in gambling, aviation, and travel sectors.
Industry‑specific crawler traffic varies; the top five industries targeted by crawlers are gambling, aviation, finance, healthcare, and ticketing, while composite industry rankings list e‑commerce, healthcare, aviation, travel, and ticketing.
Geographically, the United States, China, France, Canada, and Germany lead in crawler traffic, though Chinese figures may be under‑reported due to the country's closed network.
Harms of Crawlers
High‑volume crawlers cause more than data pollution; their harms include:
1. Price/content competition – crawlers enable price comparison, eroding product competitiveness and causing customer loss.
2. Malicious clicks – automated clicks on ads generate direct financial loss.
3. Resource waste – excessive crawler requests force companies to over‑provision backend servers.
4. Bandwidth consumption – massive crawler traffic can degrade response speed or cause denial of service.
5. Data pollution – crawlers generate large amounts of invalid data, obscuring genuine user signals and hampering analysis.
In summary, this article provides a concise classification of crawlers, examines their characteristics, presents traffic statistics from Distil Networks, and analyzes their detrimental impacts, setting the stage for the next piece on anti‑crawling classifications and current practices.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.