Fundamentals 6 min read

From Early Crawlers to ByteDance: A History of Web Scraping

This article traces the evolution of web crawlers—from early Perl scripts to modern ByteDance agents—explaining their role in search engines, business models, anti‑crawling measures, and the impact on content creation and competition.

21CTO

Nov 16, 2019

From Early Crawlers to ByteDance: A History of Web Scraping

Search engines fundamentally consist of crawling, cleaning, indexing, and classifying content, with the crawler being the core component. A crawler is a program that fetches web pages or related data.

Many internet companies started by building crawlers; by scraping existing sites they could quickly assemble content‑rich platforms, such as aggregators of WeChat public accounts.

Crawlers are also used to collect contact information for spam calls, clone websites, track competitor pricing, and automate various tasks, effectively replacing manual work.

Initially, crawlers were written in Perl, later in PHP, Java, and Python. Early web pages were few, but today large‑scale crawling requires clusters.

Google benefited from early exploration of crawling, and domestic search engine Baidu later adopted similar techniques.

The business model involves automatically gathering data into a database, deduplicating, scoring, and returning the most relevant results to users.

From an operational perspective, sites can avoid producing original content by feeding crawled material into search boxes, shaping user habits and becoming entry points for information.

With the rise of apps, information became fragmented into “islands,” especially on high‑quality WeChat public accounts, turning content into private traffic pools.

Baidu did not build its own content platform, leading users toward content apps and super‑apps, which affected its ad‑bidding model.

Today’s Toutiao functions as a search engine, delivering scraped content directly via its search box. In the first half of 2019, Toutiao Search was officially launched, providing users with direct query results without generating new articles.

Below is the User‑Agent used by Toutiao’s crawler:

Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.9857.1381 Mobile Safari/537.36; Bytespider

The agent mimics an Android phone and appends “Bytespider”. Many site owners report that the Toutiao crawler can generate up to 460,000 requests in a morning, causing crashes.

Bytespider often ignores robots.txt, the standard crawling rule file, leading to aggressive and sometimes disruptive scraping.

When ByteDance’s servers detect high‑frequency HTTP requests, they trigger alerts, trace the source, and scan the content to protect their interests.

Some developers misuse the Bytespider User‑Agent to shift blame onto Toutiao, which is improper.

From an information perspective, content providers seek exposure, while search engines need sources; mutual benefit provides the best value to users.

search engine Web Crawling data-scraping content aggregation

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.