How to Scrape 1.1 Million Zhihu Users with PHP cURL, Multi‑Threading, and Redis

This tutorial walks through collecting over a million Zhihu user profiles using PHP on Ubuntu, handling cookies, bypassing image hot‑link protection, scaling requests with curl_multi, de‑duplicating MySQL inserts, and coordinating work with Redis and multi‑process pcntl for efficient large‑scale web scraping.

21CTO
21CTO
21CTO
How to Scrape 1.1 Million Zhihu Users with PHP cURL, Multi‑Threading, and Redis

Preparation

Install Ubuntu 14.04 in a VMWare VM, then install PHP 5.6+ together with the curl and pcntl extensions.

Fetching pages with PHP cURL

Use the cURL extension to request a logged‑in Zhihu user page by sending the appropriate cookie string (e.g., __utma=?;__utmb=?;). Example code:

$url = 'http://www.zhihu.com/people/mora-hu/about';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_COOKIE, $this->config_arr['user_cookie']);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$result = curl_exec($ch);
return $result;

After retrieving the HTML, apply regular expressions to extract name, gender, etc.

Bypassing image hot‑link protection

Zhihu blocks direct image access; add a Referer: http://www.zhihu.com header when requesting avatars.

function getImg($url, $u_id)
{
    if (file_exists('./images/' . $u_id . ".jpg")) {
        return "images/$u_id.jpg";
    }
    if (empty($url)) { return ''; }
    $context_options = array(
        'http' => array(
            'header' => "Referer:http://www.zhihu.com"
        )
    );
    $context = stream_context_create($context_options);
    $img = file_get_contents('http:' . $url, FALSE, $context);
    file_put_contents('./images/' . $u_id . ".jpg", $img);
    return "images/$u_id.jpg";
}

Crawling followers and followees

Parse the user’s profile page to obtain the URLs for the "following" and "followers" lists, then request each list with the same cookie and repeat the process recursively.

Counting files on Linux

When the script runs for a long time, use the following command to count downloaded images:

ls -l | grep "^-" | wc -l

Handling duplicate MySQL records

Before inserting, check if the record exists, or use unique indexes with statements such as:

INSERT INTO … ON DUPLICATE KEY UPDATE …
INSERT IGNORE INTO …
REPLACE INTO …

Using curl_multi for I/O multiplexing

Instead of a single cURL request per process, create a multi‑handle to fetch many URLs concurrently. Example:

$mh = curl_multi_init();
for ($i = 0; $i < $max_size; $i++) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_URL, 'http://www.zhihu.com/people/' . $user_list[$i] . '/about');
    curl_setopt($ch, CURLOPT_COOKIE, self::$user_cookie);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    $requestMap[$i] = $ch;
    curl_multi_add_handle($mh, $ch);
}
// ... handle responses, add new handles as needed ...
curl_multi_close($mh);

Dealing with HTTP 429

When sending many concurrent requests, Zhihu returns 429 (Too Many Requests). Reducing the batch size to about five requests eliminates the error.

Storing visited users in Redis

Use Redis queues to keep track of already‑processed users and pending users, enabling the crawler to resume across runs.

<?php
$redis = new Redis();
$redis->connect('127.0.0.1', 6379);
$redis->set('tmp', 'value');
if ($redis->exists('tmp')) {
    echo $redis->get('tmp') . "
";
}
?>

Multi‑process with PHP pcntl

Fork multiple child processes to parallelise crawling. Example:

// PHP multi‑process demo
for ($i = 0; $i < 10; $i++) {
    $pid = pcntl_fork();
    if ($pid == -1) { echo "Could not fork!
"; exit(1); }
    if (!$pid) {
        echo "child process $i running
";
        exit($i);
    }
}
while (pcntl_waitpid(0, $status) != -1) {
    $status = pcntl_wexitstatus($status);
    echo "Child $status completed
";
}

Checking CPU information on Linux

Use cat /proc/cpuinfo to view model name and core count, which helps decide the optimal number of processes.

Redis/MySQL connection issues in forked processes

Child processes inherit the parent’s open connections, causing conflicts. Bind Redis instances to the process ID:

<?php
public static function getInstance() {
    static $instances = array();
    $key = getmypid();
    if (empty($instances[$key])) {
        $instances[$key] = new self();
    }
    return $instances[$key];
}
?>

Measuring script execution time

Utility function to compute elapsed time:

function microtime_float() {
    list($u_sec, $sec) = explode(' ', microtime());
    return (floatval($u_sec) + floatval($sec));
}
$start_time = microtime_float();
// do something
usleep(100);
$end_time = microtime_float();
$total_time = $end_time - $start_time;
$time_cost = sprintf("%.10f", $total_time);
echo "program cost total " . $time_cost . "s
";

The complete source code is available at https://github.com/hhqcontinue/zhihuSpider .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

RedisLinuxmysqlPHPcURLWeb ScrapingMulti‑processing
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.