How to Build a Zhihu Crawler with Python, ELK, and Visual Analytics

This article walks through creating a Python-based Zhihu web crawler, detailing the tech stack, data collection, visualization of user demographics and top contributors, the crawler architecture, authorization handling, and suggestions for performance and storage improvements.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Build a Zhihu Crawler with Python, ELK, and Visual Analytics

Introduction

Zhihu is a popular Chinese Q&A community built with Python, making it an attractive target for web‑scraping experiments. This guide showcases a Python‑based crawler that extracts user data from Zhihu and visualizes it using the ELK stack.

Technology Stack

Crawler: Python 2.7, requests, json, BeautifulSoup4, time Analysis tools: ELK suite (Elasticsearch, Logstash, Kibana)

IDE: PyCharm

Data Collected

The crawler retrieves partial user information from Zhihu.

Simple Visual Analysis

1. Gender Distribution

Green indicates male, red indicates female, and gray indicates unknown. The chart shows a predominance of male users.

2. Top 30 Users by Followers

The top three users are Zhang Jiawei, Li Kaifu, Huang Jixin, confirming the crawler's credibility.

3. Top 30 Users by Articles Written

Crawler Architecture

The architecture diagram is shown below.

Implementation Details

Select an active user (e.g., Li Kaifu) as the entry URL and store visited URLs in a set. Crawl each user's followees, add their URLs to another set, and filter already visited URLs. Parse personal information and save it locally.

Logstash reads the local files and forwards data to Elasticsearch; Kibana visualizes the data.

Authorization Retrieval

Open Chrome, log into Zhihu, inspect a user’s page, click “Follow”, refresh, and capture the request header’s authorization token as shown in the screenshot.

Possible Improvements

Introduce a thread pool to increase crawling speed.

Replace the in‑memory set() with Redis for URL storage.

Store crawled data in MongoDB instead of local files.

Filter users by follower count or topic participation before saving.

ELK Suite Notes

Installation details are available on the official Elastic website. A sample Logstash configuration file is shown below.

Conclusion

The harvested user data can be analyzed for geography, education, age, and more. Web crawling remains a fascinating way to extract valuable insights from the vast internet data ocean, especially in today’s content‑driven era.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ELKWeb Scrapingzhihu
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.