Information Security 10 min read

Understanding Web Crawlers: Definitions, Types, Traffic, and Harm

This article introduces web crawlers, classifies them by technology and intent, presents statistics on crawler traffic across industries and regions, and analyzes the various harms they cause, laying the groundwork for future discussions on anti‑crawling strategies.

Qunar Tech Salon

Jul 25, 2018

Understanding Web Crawlers: Definitions, Types, Traffic, and Harm

Personal introduction: Pan Hongmin joined Qunar in July 2015 and now works in the large‑accommodation division’s front‑end team, responsible for crawler analysis, anti‑crawling strategy development, platform construction and maintenance, as well as data analysis and cleaning.

"Use data to explain the situation" is a common leadership mantra, but the article questions whether data truly reflects reality, especially as crawler traffic grows and pollutes statistics, prompting many companies to adopt anti‑crawling measures.

The author notes that existing online discussions about crawlers and anti‑crawlers are fragmented, so this three‑part series will first systematically introduce crawlers, then discuss anti‑crawling and its current state, and finally present their anti‑crawling platform.

Crawler Definition

A crawler is a program that sends requests to a website, obtains resources, and extracts useful data; anti‑crawling refers to techniques that identify crawlers by unique characteristics.

Crawler Classification and Characteristics

Different clients (PC, mobile touch, app, WeChat mini‑program, RN, etc.) lead to different crawler types. This article focuses on PC and touch crawlers, as app and mini‑program crawlers require separate discussion.

Before AJAX, crawlers simply fetched page content; after AJAX, crawlers need JavaScript execution and page rendering capabilities, or they must locate key APIs to request data directly. AJAX thus marks a watershed in crawler evolution.

Based on whether a crawler generates a rendered page tree, three main categories are identified: pure API crawlers, page crawlers, and hybrid crawlers.

Pure API crawlers access key data interfaces or execute simple JS scripts to obtain parameters and fetch real data without rendering.

Page crawlers simulate a browser environment, perform multiple interactions to obtain a rendered page, and then analyze it. They can use headless browsers (e.g., PhantomJS) or driven browsers via WebDriver.

Hybrid crawlers combine the two approaches: they first use page crawling to discover parameters, then hand them to a pure API crawler for data extraction, reverting to page crawling when parameters become invalid.

Pure API crawlers are fast and low‑cost but require strong JS and anti‑crawling knowledge; page crawlers are slower and more expensive but easier to bypass anti‑crawling measures; hybrid crawlers aim to balance speed, cost, and effectiveness.

Crawlers are also classified by intent: benign crawlers (e.g., search engine bots) that increase exposure and revenue, and malicious crawlers that scrape data for competitive advantage, causing losses.

The discussion focuses on malicious crawlers, while benign crawlers are handled separately during data cleaning and identification.

Crawler Traffic

Observations show sudden spikes in website traffic that exceed server capacity, indicating crawler activity. Distil Networks reported that in 2017 over 40% of internet traffic originated from crawlers, with higher percentages in gambling, aviation, and travel sectors.

Industry‑specific crawler traffic varies; the top five industries targeted by crawlers are gambling, aviation, finance, healthcare, and ticketing, while composite industry rankings list e‑commerce, healthcare, aviation, travel, and ticketing.

Geographically, the United States, China, France, Canada, and Germany lead in crawler traffic, though Chinese figures may be under‑reported due to the country's closed network.

Harms of Crawlers

High‑volume crawlers cause more than data pollution; their harms include:

1. Price/content competition – crawlers enable price comparison, eroding product competitiveness and causing customer loss.

2. Malicious clicks – automated clicks on ads generate direct financial loss.

3. Resource waste – excessive crawler requests force companies to over‑provision backend servers.

4. Bandwidth consumption – massive crawler traffic can degrade response speed or cause denial of service.

5. Data pollution – crawlers generate large amounts of invalid data, obscuring genuine user signals and hampering analysis.

In summary, this article provides a concise classification of crawlers, examines their characteristics, presents traffic statistics from Distil Networks, and analyzes their detrimental impacts, setting the stage for the next piece on anti‑crawling classifications and current practices.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

information security Traffic analysis anti‑crawling web crawling crawler classification

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.