Backend Development 9 min read

How to Build a PHP cURL Spider to Scrape Zhihu User Data and Visualize It

This article walks through using PHP's cURL extension to crawl tens of thousands of Zhihu user profiles, parse the HTML with regular expressions, store the extracted data efficiently, and present the results with responsive charts and dashboards.

21CTO

Apr 12, 2016

How to Build a PHP cURL Spider to Scrape Zhihu User Data and Visualize It

Background

Using PHP's cURL extension, the author built a spider that experimentally crawls basic information of 50,000 Zhihu users and presents a simple analysis. The demo is available at the provided URL.

Workflow Overview

Crawl Zhihu pages with cURL.

Parse the HTML using regular expressions.

Store data in a database and deploy the program.

Analyze and display the results.

Curl Crawling

PHP's cURL library allows HTTP requests with custom headers and cookies. To fetch a user's profile page, the spider must send the user's cookie.

// Login to Zhihu, open console and get cookie
document.cookie

Example function to fetch a user page:

public function spiderUser($username)
{
    $cookie = "xxxx";
    $url_info = 'http://www.zhihu.com/people/' . $username;
    $ch = curl_init($url_info);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_COOKIE, $cookie);
    curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    $result = curl_exec($ch);
    file_put_contents('/home/cuixiaohuan/php/zhihu_spider/file/'.$username.'.html',$result);
    return true;
}

Regular‑Expression Parsing

After downloading the HTML, the spider extracts new user links, avatars, names, bios, gender, location, employment, position, education, topics, followers, and view counts using regex patterns such as:

preg_match_all('/\/people\/([\w-]+)"/i', $str, $match_arr);
self::$newUserArr = array_unique(array_merge($match_arr[1], self::$newUserArr));

Similar patterns are used for each field (avatar, name, bio, gender, etc.).

Data Storage and Optimization

For large‑scale crawling, it is recommended to write data to Redis first, or at least batch MySQL inserts. Indexes should be added only after crawling finishes. Example of a bulk insert: INSERT INTO yourtable VALUES (1,2), (5,5), ...; A simple Bash watchdog restarts the spider when it crashes:

#!/bin/bash
ps aux | grep spider | awk '{print $2}' | xargs kill -9
sleep 5s
nohup /home/cuixiaohuan/lamp/php5/bin/php /home/cuixiaohuan/php/zhihu_spider/spider_new.php &

Data Presentation

The results are visualized with ECharts 3.0 and a responsive layout that works on mobile devices. Example CSS for responsive divs is included.

Limitations and Future Work

Use multi‑cURL for parallel requests.

Further optimize regex patterns.

Integrate Redis for faster storage.

Improve mobile‑friendly layout.

Modularize JavaScript and adopt SASS for CSS.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

mysql cURL regex Web Scraping Shell script

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.