Backend Development 10 min read

Designing a Scalable, Configurable Distributed Web Crawler

This article outlines the motivation, requirements, modular decomposition, and architecture of a distributed web crawling platform that emphasizes reusability, lightweight modules, real‑time monitoring, and easy configuration for diverse data‑collection tasks.

Efficient Ops
Efficient Ops
Efficient Ops
Designing a Scalable, Configurable Distributed Web Crawler

1. Origin

Across many companies in domains such as real estate, e‑commerce, and advertising, the author repeatedly faced the same problem when developing crawlers: how to make crawler projects reusable, how to fulfill new crawling needs with minimal cost, and how to tool‑ify and configure a distributed crawling application for easy maintenance.

2. Project Requirements

Distributed Crawling : Large‑scale crawling (hundreds of thousands of pages) requires a distributed system.

Modular & Lightweight : The system is split into four roles – application layer, service layer, business‑processing layer, and scheduling layer.

Manageable & Monitored : Configuration should be manageable, and runtime monitoring (statistics, error rates, etc.) should be visible through a UI.

General & Extensible : The platform must support varied business needs (e.g., image crawling for real‑estate listings, content extraction for news) and allow extensions without code changes.

3. Module Decomposition

Application Layer

Provides two modules for administrators: a system‑configuration module (site management, online testing) and an operations‑management module (real‑time statistics, error analysis). Users can adjust configurations via the UI and see immediate effects.

Service Layer

Acts as the central data bus, exposing HTTP/Thrift interfaces to read configurations from the database and write crawl results. It also supplies real‑time reporting for the application layer.

Business‑Processing Layer

The core of the crawler, handling URL discovery and content processing. URL discovery is modeled as a configurable “discovery system” that mimics human navigation through steps (root pages, sub‑pages, link extraction, pattern matching, recursion) until the final URLs are obtained.

After URLs are discovered, processing follows a pipeline (similar to Netty’s pipeline) where each stage (fetch, JavaScript execution, generic parsing) operates on a shared context. Parsing rules define how to extract key‑value pairs, apply prefixes/suffixes, and enforce required fields.

Scheduling Layer

Manages task queues (normal and priority).

Controls discovery frequency (incremental vs. full).

Handles breakpoint‑resume and other operational concerns.

4. System Architecture Design

The architecture is viewed from several perspectives:

Business Modules : Application, Service, Business‑Processing, Scheduling.

Functional Systems : Discovery, Crawling, Configuration, Monitoring.

Extensibility : Customizable responsibility chains and attribute extraction.

Real‑time : Real‑time crawling, configuration, monitoring, and testing.

Overall Architecture : Distributed design with master‑slave service layer, lightweight dependencies (queue, database, Java).

5. Diagrams

monitoringbackend architectureConfigurationPipelineweb scraperdistributed crawling
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.