Mastering Web Crawlers: From a 3‑Line Script to Scalable Distributed Scrapers

This article explains what a web crawler is, shows a minimal three‑line Python example, expands it into a functional crawler, identifies common shortcomings, and presents practical solutions such as parallelism, priority queues, DNS caching, Bloom‑filter deduplication, storage choices, and inter‑process communication for building robust distributed scrapers.

21CTO
21CTO
21CTO
Mastering Web Crawlers: From a 3‑Line Script to Scalable Distributed Scrapers

A web crawler is a program or script that automatically fetches information from the World Wide Web according to defined rules; it is a crucial component of search engine systems, responsible for gathering pages, building indexes, and directly influencing the richness and timeliness of search results.

1. The Simplest Crawler – a Three‑Line Poem

The most basic crawler can be written in Python with just three lines:

import requests
url = "http://www.cricode.com"
r = requests.get(url)

These three lines are as concise as a three‑line poem.

2. A Normal Crawler

The previous snippet is incomplete; a functional crawler typically performs the following steps:

Fetch the seed URLs.

Parse each fetched page to extract links and add them to a collection of URLs to be crawled.

Repeat steps 1 and 2 until a termination condition is met.

A more complete example (still under 20 lines) looks like this:

import requests                       # fetch pages
from bs4 import BeautifulSoup         # parse pages
seeds = ["http://www.hao123.com", "http://www.csdn.net", "http://www.cricode.com"]
count = 0
while count < 10000:
    if count < len(seeds):
        r = requests.get(seeds[count])
        count += 1
        # do_save_action(r)
        soup = BeautifulSoup(r.content, "html.parser")
        urls = soup.find_all("a", href=True)  # parse links
        for url in urls:
            seeds.append(url["href"])
    else:
        break

3. Spotting the Problems

The above crawler has many shortcomings:

It runs single‑threaded, making it too slow for large‑scale crawling.

All URLs are stored in a simple list; a queue or priority queue would be more appropriate.

All sites are treated equally; a “big‑site first” strategy is often desirable.

Each request triggers a DNS lookup; caching DNS results can save time.

No deduplication is performed, leading to repeated fetching of the same URLs.

…and many other inefficiencies.

4. Solutions to the Identified Issues

1) Parallel Crawling

Parallelism can be achieved with multithreading, thread pools, or by deploying multiple crawler instances across machines. Distributed architectures such as master‑slave, peer‑to‑peer, or hybrid models enable load balancing and fault tolerance, often using consistent hashing for task assignment.

2) URL Queue Management

Using a priority queue (or multi‑level feedback queue) allows important pages to be crawled first, similar to operating‑system process scheduling.

3) DNS Caching

Implement a hash‑table‑based DNS cache to store domain‑to‑IP mappings and avoid repeated lookups.

4) Page Deduplication

A Bloom filter provides an efficient probabilistic method to detect previously seen URLs, reducing redundant requests.

5) Data Storage

Choose an appropriate storage backend—relational databases, NoSQL stores, or custom file formats—based on scalability and query requirements.

6) Inter‑Process Communication

Distributed crawlers need a defined data format for exchanging information between processes, enabling coordination and result aggregation.

Implementing these techniques transforms a naïve script into a robust, high‑performance web crawler capable of handling massive workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

deduplicationParallelismWeb Crawlingdns cachedistributed scraping
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.