Backend Development 9 min read

Mastering Web Crawlers: Strategies, Tools, and Practical Code Samples

This article explores the fundamentals and advanced techniques of building web crawlers, covering crawler types, essential features, RSS/ATOM harvesting, custom scraping methods, PHP header manipulation, regex extraction, and concurrency, providing actionable code examples for backend developers.

21CTO

Nov 20, 2016

Mastering Web Crawlers: Strategies, Tools, and Practical Code Samples

Storytelling aside, building a web crawler is more complex than it appears, but it starts from simple tools like hammers, axes, and utility knives.

Crawlers can be categorized by scope: large‑scale crawlers that index entire sites (e.g., Storm Crawler, Elasticsearch, Lucene, Sphinx) and small‑scale crawlers targeting specific content.

Companies like Baidu and Google operate massive distributed storage clusters and load‑balance requests across hundreds of thousands of servers to handle search queries.

Key Crawler Architecture Traits

Robustness: Avoid spider traps, handle loops, pause when a site is unavailable, and resume promptly.

Politeness: Respect robots.txt, privacy, and copyright constraints.

Quality: Capture fresh, timely content.

Visualization: Real‑time monitoring of crawling progress and post‑processing data visualization.

Scalability: Support scheduled tasks, distributed deployment, proxy usage, user‑login simulation, and custom macros.

For news recommendation, we only need to fetch news articles, not the entire site, so a targeted crawler suffices.

RSS and ATOM Sources

RSS/ATOM feeds, originating from the Netscape era, are widely used by news sites to publish updates. An RSS feed is an XML file listing recent articles with titles, excerpts, timestamps, and image URLs. Example: https://techcrunch.com/feed/.

When a site lacks a feed or updates infrequently, custom scraping is required by simulating browsers and tailoring extraction logic.

Data Sources and Customized Crawling

Before analyzing a target site, identify its technology stack using get_headers() or curl to retrieve HTTP headers (e.g., status codes, server information). Some sites employ Ajax or JavaScript to generate HTML, complicating scraping.

Mobile‑optimized sites (e.g., http://m.sina.com.cn) often have simpler markup. Certain sites restrict access to specific browsers like WeChat; spoofing the User‑Agent header can bypass this restriction.

Example PHP header configuration for mimicking browsers:

$header[] = 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8';
$header[] = 'Accept-Language: zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3';
// Set as WeChat browser
if($is_wechat == 1){
    $header[] = "User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_2 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Mobile/10B146 MicroMessenger/5.0";
} else {
    $header[] = 'User-Agent: Mozilla/5.0 (Windows NT 5.2; rv:25.0) Gecko/20100101 Firefox/25.2';
}
$header[] = 'Host: ' . $aurl['host'];
$header[] = 'Connection: Keep-Alive';
$header[] = 'Cookie: tracknick=夏文闹';

With proper headers, we can retrieve page content and proceed to extract list pages.

List pages typically contain titles, summaries, and thumbnails. Using regular expressions, we can capture these elements. Example PHP extraction code:

preg_match_all('/<ul class="attr_info fang fang_new">(.*)<\/ul>/siU', $house_details, $house_info); // Extract contacts, phone
preg_match_all('/<p class="llname">(.*)<\/p>/siU', $house_details, $contact_personal);
preg_match_all('/<p class="llnumber">(.*)<\/p>/siU', $house_details, $contact_number);
preg_match_all('/<div class="image_area image_area_new">(.*)<\/div>/siU', $house_details, $house_images);
preg_match_all('/ref="(.*?)"/s', $house_images[1][0], $house_image);
foreach($house_image[1] as $img){
    echo $img . "<br />";
}

This approach allows pagination through list pages, following links to detail pages for bulk collection.

Detail pages can be parsed with regular expressions or with SimpleXML / DomDocument using XPath. When HTML fragments are incomplete, regex may be more reliable; otherwise, XPath offers higher efficiency.

Concurrent downloading can be achieved with curl_multi, as described in "PHP and MySQL High‑Performance Application Development".

Conclusion

The article presented crawler development strategies and practical techniques. By applying these methods, you can build a functional web crawler. Future posts will cover cookie/session handling, RSS crawling, update monitoring, deduplication, data cleaning, and crawler management.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Development data extraction web crawling RSS Scraping

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.