How to Build a Simple PHP Web Crawler: From Robots.txt to cURL
This guide explains the fundamentals of creating a PHP web crawler, covering server communication basics, interpreting robots.txt and sitemap files, and providing practical code examples using file_get_contents and cURL for efficient content retrieval.
Just as people communicate using a common language, servers on the Internet converse through protocols such as socket, HTTP, and UDP, allowing a crawler to act like a tiny browser that sends GET requests to retrieve web resources.
Before scraping a site, it is essential to understand its directory structure and respect the robots.txt file, which tells search engine bots which paths are allowed or disallowed, and the Sitemap that outlines the site’s layout.
For example, QQ’s robots.txt is fully open, allowing all crawlers and providing a sitemap URL, while Taobao’s robots.txt contains specific Allow and Disallow rules for different user agents, restricting access for unknown bots.
When building a crawler, you can use PHP’s built‑in functions like fopen, file, and file_get_contents (after enabling remote access in php.ini) to fetch remote content.
Below is a lightweight crawler function using file_get_contents and DOMXPath to extract data from a given HTML element:
<?php
function crawler($url, $id){
$html = file_get_contents($url);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$data = "";
$elements = $xPath->query($id);
foreach($elements as $e){
$data .= $e->nodeValue;
}
$pattern = '/,/' ;
$final = preg_replace($pattern, "", $data);
return $final;
}
?>For better performance, PHP’s cURL extension can be used. The following code wraps cURL in a helper function and updates the crawler to use it:
<?php
function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function crawler($url, $id){
$html = file_get_contents_curl($url);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$data = "";
$elements = $xPath->query($id);
foreach($elements as $e){
$data .= $e->nodeValue;
}
$pattern = '/,/' ;
$final = preg_replace($pattern, "", $data);
return $final;
}
?>These functions provide a faster and more robust way to fetch pages. In the next steps you can extend them to analyze a site’s technology stack, extract links, handle errors, manage delays, and send cookies for a complete crawling solution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
