Backend Development 3 min read

How to Write a Simple PHP Web Crawler

This guide explains how to create a basic PHP web crawler by using cURL to fetch pages, DOMDocument and XPath to parse HTML, and then storing the extracted data, while also providing a complete example script and reminders about legal and ethical considerations.

php Courses

May 4, 2023

To write a web crawler in PHP, follow these steps:

1. Obtain target website resources: This typically involves sending HTTP/HTTPS requests, parsing the HTML page, and extracting the needed data. You can use PHP's built-in cURL extension for requests and the DOM extension for parsing.

2. Parse page content: Once you have the HTML, use techniques such as XPath or CSS selectors to locate the desired elements.

3. Save the data: After extracting the required information, store it in a database or file for later use.

Below is a simple PHP crawler example:

<?php

// 设置目标网站 URL
$url = 'https://www.example.com';

// 初始化 cURL，设置 URL 和请求选项
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// 发起请求并获取响应内容
$html = curl_exec($ch);

// 关闭 cURL 资源
curl_close($ch);

// 创建 DOM 对象并加载 HTML 内容
$dom = new DOMDocument();
@$dom->loadHTML($html);

// 创建 XPath 查询对象
$xpath = new DOMXPath($dom);

// 使用 XPath 查询节点
$nodes = $xpath->query('//div[@class="content"]');

// 循环输出查询结果
foreach ($nodes as $node) {
    echo $node->textContent . "
";
}

?>

The above is a minimal example; real‑world crawlers need adjustments and optimizations based on specific requirements.

Note that developing crawlers must comply with laws and respect a site's robots.txt policy, avoiding excessive load on the target server or infringing user privacy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Development PHP cURL XPath Web Crawler DOMDocument

Written by

php Courses

php中文网's platform for the latest courses and technical articles, helping PHP learners advance quickly.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.