Tagged articles

distributed scraping

2 articles · Page 1 of 1

Oct 21, 2018 · Backend Development

Mastering Web Crawlers: Core Modules, HTTP Strategies, and Scaling Tips

This article explains the fundamentals of web crawlers, covering their three main modules, HTTP request composition, flow‑control techniques for large‑scale scraping, content extraction methods for static and dynamic pages, and the current challenges such as interaction hurdles, JavaScript parsing, and IP restrictions.

Content ExtractionHTTP requestsdistributed scraping

0 likes · 13 min read

Mastering Web Crawlers: Core Modules, HTTP Strategies, and Scaling Tips

21CTO

Jun 9, 2016 · Backend Development

Mastering Web Crawlers: From a 3‑Line Script to Scalable Distributed Scrapers

This article explains what a web crawler is, shows a minimal three‑line Python example, expands it into a functional crawler, identifies common shortcomings, and presents practical solutions such as parallelism, priority queues, DNS caching, Bloom‑filter deduplication, storage choices, and inter‑process communication for building robust distributed scrapers.

Deduplicationdistributed scrapingdns cache

0 likes · 9 min read

Mastering Web Crawlers: From a 3‑Line Script to Scalable Distributed Scrapers