How to Scrape 1.1 Million Zhihu Users with PHP cURL, Multi‑Threading, and Redis
This tutorial walks through collecting over a million Zhihu user profiles using PHP on Ubuntu, handling cookies, bypassing image hot‑link protection, scaling requests with curl_multi, de‑duplicating MySQL inserts, and coordinating work with Redis and multi‑process pcntl for efficient large‑scale web scraping.
Preparation
Install Ubuntu 14.04 in a VMWare VM, then install PHP 5.6+ together with the curl and pcntl extensions.
Fetching pages with PHP cURL
Use the cURL extension to request a logged‑in Zhihu user page by sending the appropriate cookie string (e.g., __utma=?;__utmb=?;). Example code:
$url = 'http://www.zhihu.com/people/mora-hu/about';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_COOKIE, $this->config_arr['user_cookie']);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$result = curl_exec($ch);
return $result;After retrieving the HTML, apply regular expressions to extract name, gender, etc.
Bypassing image hot‑link protection
Zhihu blocks direct image access; add a Referer: http://www.zhihu.com header when requesting avatars.
function getImg($url, $u_id)
{
if (file_exists('./images/' . $u_id . ".jpg")) {
return "images/$u_id.jpg";
}
if (empty($url)) { return ''; }
$context_options = array(
'http' => array(
'header' => "Referer:http://www.zhihu.com"
)
);
$context = stream_context_create($context_options);
$img = file_get_contents('http:' . $url, FALSE, $context);
file_put_contents('./images/' . $u_id . ".jpg", $img);
return "images/$u_id.jpg";
}Crawling followers and followees
Parse the user’s profile page to obtain the URLs for the "following" and "followers" lists, then request each list with the same cookie and repeat the process recursively.
Counting files on Linux
When the script runs for a long time, use the following command to count downloaded images:
ls -l | grep "^-" | wc -lHandling duplicate MySQL records
Before inserting, check if the record exists, or use unique indexes with statements such as:
INSERT INTO … ON DUPLICATE KEY UPDATE … INSERT IGNORE INTO … REPLACE INTO …Using curl_multi for I/O multiplexing
Instead of a single cURL request per process, create a multi‑handle to fetch many URLs concurrently. Example:
$mh = curl_multi_init();
for ($i = 0; $i < $max_size; $i++) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_URL, 'http://www.zhihu.com/people/' . $user_list[$i] . '/about');
curl_setopt($ch, CURLOPT_COOKIE, self::$user_cookie);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$requestMap[$i] = $ch;
curl_multi_add_handle($mh, $ch);
}
// ... handle responses, add new handles as needed ...
curl_multi_close($mh);Dealing with HTTP 429
When sending many concurrent requests, Zhihu returns 429 (Too Many Requests). Reducing the batch size to about five requests eliminates the error.
Storing visited users in Redis
Use Redis queues to keep track of already‑processed users and pending users, enabling the crawler to resume across runs.
<?php
$redis = new Redis();
$redis->connect('127.0.0.1', 6379);
$redis->set('tmp', 'value');
if ($redis->exists('tmp')) {
echo $redis->get('tmp') . "
";
}
?>Multi‑process with PHP pcntl
Fork multiple child processes to parallelise crawling. Example:
// PHP multi‑process demo
for ($i = 0; $i < 10; $i++) {
$pid = pcntl_fork();
if ($pid == -1) { echo "Could not fork!
"; exit(1); }
if (!$pid) {
echo "child process $i running
";
exit($i);
}
}
while (pcntl_waitpid(0, $status) != -1) {
$status = pcntl_wexitstatus($status);
echo "Child $status completed
";
}Checking CPU information on Linux
Use cat /proc/cpuinfo to view model name and core count, which helps decide the optimal number of processes.
Redis/MySQL connection issues in forked processes
Child processes inherit the parent’s open connections, causing conflicts. Bind Redis instances to the process ID:
<?php
public static function getInstance() {
static $instances = array();
$key = getmypid();
if (empty($instances[$key])) {
$instances[$key] = new self();
}
return $instances[$key];
}
?>Measuring script execution time
Utility function to compute elapsed time:
function microtime_float() {
list($u_sec, $sec) = explode(' ', microtime());
return (floatval($u_sec) + floatval($sec));
}
$start_time = microtime_float();
// do something
usleep(100);
$end_time = microtime_float();
$total_time = $end_time - $start_time;
$time_cost = sprintf("%.10f", $total_time);
echo "program cost total " . $time_cost . "s
";The complete source code is available at https://github.com/hhqcontinue/zhihuSpider .
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
