How to Build a PHP cURL Spider to Scrape Zhihu User Data and Visualize It
This article walks through using PHP's cURL extension to crawl tens of thousands of Zhihu user profiles, parse the HTML with regular expressions, store the extracted data efficiently, and present the results with responsive charts and dashboards.
Background
Using PHP's cURL extension, the author built a spider that experimentally crawls basic information of 50,000 Zhihu users and presents a simple analysis. The demo is available at the provided URL.
Workflow Overview
Crawl Zhihu pages with cURL.
Parse the HTML using regular expressions.
Store data in a database and deploy the program.
Analyze and display the results.
Curl Crawling
PHP's cURL library allows HTTP requests with custom headers and cookies. To fetch a user's profile page, the spider must send the user's cookie.
// Login to Zhihu, open console and get cookie
document.cookieExample function to fetch a user page:
public function spiderUser($username)
{
$cookie = "xxxx";
$url_info = 'http://www.zhihu.com/people/' . $username;
$ch = curl_init($url_info);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_COOKIE, $cookie);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$result = curl_exec($ch);
file_put_contents('/home/cuixiaohuan/php/zhihu_spider/file/'.$username.'.html',$result);
return true;
}Regular‑Expression Parsing
After downloading the HTML, the spider extracts new user links, avatars, names, bios, gender, location, employment, position, education, topics, followers, and view counts using regex patterns such as:
preg_match_all('/\/people\/([\w-]+)"/i', $str, $match_arr);
self::$newUserArr = array_unique(array_merge($match_arr[1], self::$newUserArr));Similar patterns are used for each field (avatar, name, bio, gender, etc.).
Data Storage and Optimization
For large‑scale crawling, it is recommended to write data to Redis first, or at least batch MySQL inserts. Indexes should be added only after crawling finishes. Example of a bulk insert: INSERT INTO yourtable VALUES (1,2), (5,5), ...; A simple Bash watchdog restarts the spider when it crashes:
#!/bin/bash
ps aux | grep spider | awk '{print $2}' | xargs kill -9
sleep 5s
nohup /home/cuixiaohuan/lamp/php5/bin/php /home/cuixiaohuan/php/zhihu_spider/spider_new.php &Data Presentation
The results are visualized with ECharts 3.0 and a responsive layout that works on mobile devices. Example CSS for responsive divs is included.
Limitations and Future Work
Use multi‑cURL for parallel requests.
Further optimize regex patterns.
Integrate Redis for faster storage.
Improve mobile‑friendly layout.
Modularize JavaScript and adopt SASS for CSS.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
