Mastering Web Crawlers: Strategies, Tools, and Practical Code Samples
This article explores the fundamentals and advanced techniques of building web crawlers, covering crawler types, essential features, RSS/ATOM harvesting, custom scraping methods, PHP header manipulation, regex extraction, and concurrency, providing actionable code examples for backend developers.
Storytelling aside, building a web crawler is more complex than it appears, but it starts from simple tools like hammers, axes, and utility knives.
Crawlers can be categorized by scope: large‑scale crawlers that index entire sites (e.g., Storm Crawler, Elasticsearch, Lucene, Sphinx) and small‑scale crawlers targeting specific content.
Companies like Baidu and Google operate massive distributed storage clusters and load‑balance requests across hundreds of thousands of servers to handle search queries.
Key Crawler Architecture Traits
Robustness: Avoid spider traps, handle loops, pause when a site is unavailable, and resume promptly.
Politeness: Respect robots.txt, privacy, and copyright constraints.
Quality: Capture fresh, timely content.
Visualization: Real‑time monitoring of crawling progress and post‑processing data visualization.
Scalability: Support scheduled tasks, distributed deployment, proxy usage, user‑login simulation, and custom macros.
For news recommendation, we only need to fetch news articles, not the entire site, so a targeted crawler suffices.
RSS and ATOM Sources
RSS/ATOM feeds, originating from the Netscape era, are widely used by news sites to publish updates. An RSS feed is an XML file listing recent articles with titles, excerpts, timestamps, and image URLs. Example: https://techcrunch.com/feed/.
When a site lacks a feed or updates infrequently, custom scraping is required by simulating browsers and tailoring extraction logic.
Data Sources and Customized Crawling
Before analyzing a target site, identify its technology stack using get_headers() or curl to retrieve HTTP headers (e.g., status codes, server information). Some sites employ Ajax or JavaScript to generate HTML, complicating scraping.
Mobile‑optimized sites (e.g., http://m.sina.com.cn) often have simpler markup. Certain sites restrict access to specific browsers like WeChat; spoofing the User‑Agent header can bypass this restriction.
Example PHP header configuration for mimicking browsers:
$header[] = 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$header[] = 'Accept-Language: zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3';
// Set as WeChat browser
if($is_wechat == 1){
$header[] = "User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_2 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Mobile/10B146 MicroMessenger/5.0";
} else {
$header[] = 'User-Agent: Mozilla/5.0 (Windows NT 5.2; rv:25.0) Gecko/20100101 Firefox/25.2';
}
$header[] = 'Host: ' . $aurl['host'];
$header[] = 'Connection: Keep-Alive';
$header[] = 'Cookie: tracknick=夏文闹';With proper headers, we can retrieve page content and proceed to extract list pages.
List pages typically contain titles, summaries, and thumbnails. Using regular expressions, we can capture these elements. Example PHP extraction code:
preg_match_all('/<ul class="attr_info fang fang_new">(.*)<\/ul>/siU', $house_details, $house_info); // Extract contacts, phone
preg_match_all('/<p class="llname">(.*)<\/p>/siU', $house_details, $contact_personal);
preg_match_all('/<p class="llnumber">(.*)<\/p>/siU', $house_details, $contact_number);
preg_match_all('/<div class="image_area image_area_new">(.*)<\/div>/siU', $house_details, $house_images);
preg_match_all('/ref="(.*?)"/s', $house_images[1][0], $house_image);
foreach($house_image[1] as $img){
echo $img . "<br />";
}This approach allows pagination through list pages, following links to detail pages for bulk collection.
Detail pages can be parsed with regular expressions or with SimpleXML / DomDocument using XPath. When HTML fragments are incomplete, regex may be more reliable; otherwise, XPath offers higher efficiency.
Concurrent downloading can be achieved with curl_multi, as described in "PHP and MySQL High‑Performance Application Development".
Conclusion
The article presented crawler development strategies and practical techniques. By applying these methods, you can build a functional web crawler. Future posts will cover cookie/session handling, RSS crawling, update monitoring, deduplication, data cleaning, and crawler management.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
