Tagged articles
2 articles
Page 1 of 1
MaGe Linux Operations
MaGe Linux Operations
Oct 21, 2018 · Backend Development

Mastering Web Crawlers: Core Modules, HTTP Strategies, and Scaling Tips

This article explains the fundamentals of web crawlers, covering their three main modules, HTTP request composition, flow‑control techniques for large‑scale scraping, content extraction methods for static and dynamic pages, and the current challenges such as interaction hurdles, JavaScript parsing, and IP restrictions.

Content ExtractionHTTP requestsdistributed scraping
0 likes · 13 min read
Mastering Web Crawlers: Core Modules, HTTP Strategies, and Scaling Tips
21CTO
21CTO
Jun 9, 2016 · Backend Development

Mastering Web Crawlers: From a 3‑Line Script to Scalable Distributed Scrapers

This article explains what a web crawler is, shows a minimal three‑line Python example, expands it into a functional crawler, identifies common shortcomings, and presents practical solutions such as parallelism, priority queues, DNS caching, Bloom‑filter deduplication, storage choices, and inter‑process communication for building robust distributed scrapers.

ParallelismWeb Crawlingdeduplication
0 likes · 9 min read
Mastering Web Crawlers: From a 3‑Line Script to Scalable Distributed Scrapers