Backend Development 6 min read

How to Build a Simple PHP Web Crawler: From Robots.txt to cURL

This guide explains the fundamentals of creating a PHP web crawler, covering server communication basics, interpreting robots.txt and sitemap files, and providing practical code examples using file_get_contents and cURL for efficient content retrieval.

21CTO

Nov 13, 2016

How to Build a Simple PHP Web Crawler: From Robots.txt to cURL

Just as people communicate using a common language, servers on the Internet converse through protocols such as socket, HTTP, and UDP, allowing a crawler to act like a tiny browser that sends GET requests to retrieve web resources.

Before scraping a site, it is essential to understand its directory structure and respect the robots.txt file, which tells search engine bots which paths are allowed or disallowed, and the Sitemap that outlines the site’s layout.

For example, QQ’s robots.txt is fully open, allowing all crawlers and providing a sitemap URL, while Taobao’s robots.txt contains specific Allow and Disallow rules for different user agents, restricting access for unknown bots.

When building a crawler, you can use PHP’s built‑in functions like fopen, file, and file_get_contents (after enabling remote access in php.ini) to fetch remote content.

Below is a lightweight crawler function using file_get_contents and DOMXPath to extract data from a given HTML element:

<?php
function crawler($url, $id){
    $html = file_get_contents($url);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $xPath = new DOMXPath($dom);
    $data = "";
    $elements = $xPath->query($id);
    foreach($elements as $e){
        $data .= $e->nodeValue;
    }
    $pattern = '/,/' ;
    $final = preg_replace($pattern, "", $data);
    return $final;
}
?>

For better performance, PHP’s cURL extension can be used. The following code wraps cURL in a helper function and updates the crawler to use it:

<?php
function file_get_contents_curl($url){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

function crawler($url, $id){
    $html = file_get_contents_curl($url);
    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $xPath = new DOMXPath($dom);
    $data = "";
    $elements = $xPath->query($id);
    foreach($elements as $e){
        $data .= $e->nodeValue;
    }
    $pattern = '/,/' ;
    $final = preg_replace($pattern, "", $data);
    return $final;
}
?>

These functions provide a faster and more robust way to fetch pages. In the next steps you can extend them to analyze a site’s technology stack, extract links, handle errors, manage delays, and send cookies for a complete crawling solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Development PHP cURL robots.txt Web Crawler

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.