Tagged articles

web crawling

108 articles · Page 2 of 2
21CTO
21CTO
Nov 9, 2016 · Backend Development

Unlocking the Power of Web Crawlers: How to Harvest Data Efficiently

This article explains what web crawlers are, why they’re essential for content recommendation systems, the technical approaches across languages, practical use‑cases like price monitoring and news aggregation, and best practices for building efficient, ethical crawlers.

Backend Developmentdata extractionweb crawling
0 likes · 5 min read
Unlocking the Power of Web Crawlers: How to Harvest Data Efficiently
21CTO
21CTO
Jun 9, 2016 · Backend Development

Mastering Web Crawlers: From a 3‑Line Script to Scalable Distributed Scrapers

This article explains what a web crawler is, shows a minimal three‑line Python example, expands it into a functional crawler, identifies common shortcomings, and presents practical solutions such as parallelism, priority queues, DNS caching, Bloom‑filter deduplication, storage choices, and inter‑process communication for building robust distributed scrapers.

Deduplicationdistributed scrapingdns cache
0 likes · 9 min read
Mastering Web Crawlers: From a 3‑Line Script to Scalable Distributed Scrapers
ITPUB
ITPUB
May 6, 2016 · Backend Development

Scrapy vs. Gevent: Choosing the Right Python Web‑Crawling Framework

This guide compares Scrapy (especially version 0.16) with gevent‑based crawling solutions, outlines their strengths, weaknesses, and common pitfalls, and provides practical tips, resource links, and deployment advice for building efficient Python web scrapers.

PythonScrapingScrapy
0 likes · 11 min read
Scrapy vs. Gevent: Choosing the Right Python Web‑Crawling Framework
21CTO
21CTO
Dec 22, 2015 · Big Data

How to Build a Scalable Distributed Web Crawler for Massive Data Harvesting

This article explains how to design and implement a distributed web‑crawling framework in Java that can collect, structure, and store massive amounts of data while handling anti‑scraping measures, duplicate detection, and real‑time monitoring.

Big DataJavadata extraction
0 likes · 11 min read
How to Build a Scalable Distributed Web Crawler for Massive Data Harvesting
Qunar Tech Salon
Qunar Tech Salon
Nov 30, 2015 · Backend Development

Choosing a Web Crawler: Nutch, Crawler4j, WebMagic, WebCollector, Scrapy, or Others

This article compares distributed, Java‑based, and non‑Java web crawlers—examining Nutch, Crawler4j, WebMagic, WebCollector, Scrapy and alternatives—highlighting their strengths, limitations, and suitability for tasks such as data extraction, multi‑threading, AJAX handling, and search‑engine construction.

NutchScrapycrawler frameworks
0 likes · 11 min read
Choosing a Web Crawler: Nutch, Crawler4j, WebMagic, WebCollector, Scrapy, or Others
21CTO
21CTO
Oct 21, 2015 · Fundamentals

How Graph Traversal Powers Web Crawlers: From BFS to Internet Indexing

This article explains how graph traversal algorithms like BFS and DFS underpin web crawlers, illustrating the concepts with examples from China's road network and tracing the history from Euler's bridges to modern internet indexing.

BFSDFSSearch Engine
0 likes · 6 min read
How Graph Traversal Powers Web Crawlers: From BFS to Internet Indexing