Information Security 22 min read

Why 95% of Web Traffic Is Bots: Inside the Crawling Arms Race

The article explores the hidden, high‑traffic world of web crawlers and anti‑crawling measures, revealing why most online requests are bots, how companies decide to crawl or block, the technical and organizational challenges involved, and what the future may hold for this perpetual cat‑and‑mouse game.

21CTO

Jun 24, 2017

Why 95% of Web Traffic Is Bots: Inside the Crawling Arms Race

Preface

Crawling and anti‑crawling are often described as an unsavory, underground industry where companies rarely admit to having crawler teams, and engineers struggle to turn their experience into impressive resumes.

Despite its reputation, the industry thrives because businesses need data, whether for price comparison, market analysis, or other strategic purposes.

1. Current State of Crawling and Anti‑Crawling

In e‑commerce, crawlers originally served price‑comparison tools, but the practice quickly turned hostile as competitors used crawlers to steal pricing data, prompting a rapid escalation of anti‑crawling defenses.

Statistics suggest that over 50% of internet traffic is generated by bots; a typical page with 12,000 requests may have only about 500 genuine users, yielding a bot‑to‑human ratio of roughly 96%.

These massive crawling volumes often stem from poor decision‑making rather than genuine business need.

2. Technical Status

Python is the dominant language for crawler scripts, but it is less suited for anti‑crawling logic, which frequently relies on JavaScript. Nevertheless, Python’s glue‑language nature makes it useful for rapid rewrites when anti‑crawling measures change.

Engineers frequently switch frameworks (e.g., Selenium, headless browsers) to bypass blocks, leading to a constant arms race that consumes developer time and hampers career growth.

Anti‑crawling teams face high false‑positive rates when blocking IPs, as IPs are shared, proxied, or dynamically reassigned, making IP bans ineffective and often harmful to legitimate users.

Common tactics include rendering critical data as images, but OCR and machine‑learning advances have rendered this approach less effective.

Frequent releases are used to stay ahead of attackers, but each release raises the risk of bugs and accidental “mis‑hits.”

3. Tactics and Evolution

Front‑end engineers become the de‑facto defenders when back‑end solutions fail, leveraging JavaScript quirks, Node.js features, and complex browser‑specific behaviors to increase the difficulty of automated scraping.

Advanced anti‑scraping methods such as canvas fingerprinting are discussed, but their effectiveness is limited in homogeneous hardware environments.

Legal recourse against crawlers is possible but rarely practical because most scraped data is used internally and not publicly disclosed.

The ongoing “war” between crawler and anti‑crawler teams leads to a culture of constant teasing, flag‑setting, and occasional cooperation after prolonged conflict.

4. Future Outlook

Post‑conflict periods see a shift toward whitelisting trusted partners and reducing aggressive blocking to avoid harming business relationships.

Nevertheless, new competitors will inevitably re‑ignite the arms race as profit motives drive renewed crawling activity.

The cycle creates more specialized roles, raising the market value of both crawler and anti‑crawler engineers.

Author: Cui Guangyu, Development Manager at Ctrip Hotel R&D, former anti‑crawling colleague at Qunar, “non‑famous” humorist at Ctrip Tech Center.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend information security anti‑crawling web crawling Industry

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.