How to Configure a PHP Web Scraper: Essential Settings Explained
This article presents a complete PHP configuration file for a flexible web crawler, detailing database connection parameters, target site URLs, pagination controls, content extraction patterns, and optional text filters to help developers quickly set up and customize their scraping projects.
Below is a PHP configuration file for a flexible web crawler, including database connection settings, target website URLs, pagination, content extraction patterns, and optional filters.
<code>/**
* 可以灵活配置使用的采集器
* 作者:Rain
* 创建时间:2015-02-03 15:17:30
* 版本信息:V1.0
*/
//数据库的相关配置信息,请根据您的数据库信息进行配置
define('DB_HOST', 'localhost');
define('DB_USER', 'root');
define('DB_PWD', 'test123456');
define('DB_NAME', 'test_dbname');
define('DB_CHARSET', 'utf8');
define('TABLE_NAME', 'tb_book');
//网站信息相关的配置,请根据具体需要采集的网站内容信息进行配置
define('WEB_CHARSET', 'gbk');
//变动的参数,使用%d进行替换,只支持数值形式的变动
define('WEB_LIST_URL', 'http://www.pcbookcn.com/book/1_%d.htm');
//分页的条数
define('PAGE_COUNT', 14);
//从哪个页面开始抓取
define('PAGE_START', 1);
//内容页的URL,使用正则模式,必须包含/,例如:/\/xuefu2008\/article\/details\/(\d)+/i
define('WEB_CONTENT_URL_REG', '/\/book\/(\d)+\.htm/i');
//网站域名HOST信息,不包含末尾的/,例如:http://blog.csdn.net
define('WEB_HOST', 'http://www.pcbookcn.com');
//列表页内容的精准定位,用来大致抓取一个列表页的内容显示模块位置,使用正则进行定位
define('WEB_LIST_POSTION', '/book_name\.gif(.*?)/');
//微调参数,通常不修改也不会影响您的正常使用
define('SLEEP_TIME', 1);
define('IS_DEBUG', false);
define('INSERT_DB', true);
//内容的输出速度,单位:秒
define('OUTPUT_SPEED', 1);
//需要过滤删除的文字,根据采集的网站类型进行设置,不区分大小写
$text_filter = array(
'- 中华电脑书库' => '',
'_电脑电子书' => '',
'_电脑书籍' => '',
'下载' => '',
);
//表结构映射的配置
$table_mapping = array(
'size' => '/软件大小.*?000000>(.*?)<\/font>/i',
'logo' => 'http://www.94cto.com/index/uploads/images/20150105/0b8461910de101cc51a07684cdab797e.jpg',
'field1' => '/'
);
</code>Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.