Backend Development 10 min read

How to Build a Python Web Scraper for Zhihu: Login, User Data, and Answers

This article walks through using Python's requests and BeautifulSoup libraries to simulate Zhihu login, extract user profiles, retrieve answer likers, download avatars, collect all answers for a question, and store the data in SQLite, while highlighting common challenges and future enhancements.

MaGe Linux Operations

Oct 16, 2018

How to Build a Python Web Scraper for Zhihu: Login, User Data, and Answers

Introduction

Recently I studied web crawling and implemented a Python scraper for Zhihu, summarizing the process. Web crawlers automatically fetch information from the web, which is essential for machine learning and data mining that rely on large datasets.

Tools

Python offers many open‑source packages; I used requests, BeautifulSoup4, and json. requests handles HTTP requests, while bs4 and json extract the desired data.

Simulating Login

To crawl Zhihu, login is required. The provided login function uses a Zhihu account (replace the data fields with your own credentials). The first run prompts for a captcha, saving a cookiefile and zhihucaptcha.gif. Subsequent runs reuse the saved cookie.

A global s = requests.session() maintains the logged‑in session throughout the crawl.

Fetching User Basic Information

Each Zhihu user has a unique ID (e.g., marcovaldong). Accessing https://www.zhihu.com/people/<userID> reveals profile details such as location, industry, gender, education, vote counts, followers, and followees. The function get_userInfo(userID) returns a list of 19 fields including nickname, ID, location, company, position, school, major, vote counts, question/answer counts, article count, collection count, edit count, follower/followee numbers, and profile views.

Getting All Likers of an Answer

Each Zhihu question and answer has a unique numeric ID. To obtain the likers of a specific answer, capture the voters_profile request (e.g., https://www.zhihu.com/answer/5430533/voters_profile) which returns a JSON containing liker information. The JSON provides pagination (20 likers per request). Extract each liker’s nickname, profile URL, vote count, thanks count, question count, and answer count, and save the data to a text file named after the answer ID.

Anonymous or deleted users are noted as “information missing” in the output.

Fetching Followers and Followees

Using the same approach, you can retrieve a user's followers and followees. When crawling a high‑profile user, Zhihu may limit requests after a certain number (e.g., 10,020 followers), causing errors that require additional anti‑blocking strategies.

Extracting User Avatars

Given a user ID, a function parses the user's profile page to locate the avatar URL, downloads the image, and saves it locally with the user ID as the filename.

Crawling All Answers of a Question

Providing a question ID, a function iterates through all answers, extracts the textual content (images are omitted), and stores each answer in a text file named after the answer author’s ID.

Storing Data in a Database

After gathering the data, I used sqlite3 to create a table and store user information for easy retrieval. Future work includes bulk crawling of users and their follow relationships, visualizing the network, learning Scrapy, and extending crawling to other platforms such as Weibo.

Anti‑Scraping Considerations

Zhihu may block frequent access, prompting captchas. Implementing rate limiting and other anti‑anti‑scraping techniques will be necessary for robust crawling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

zhihu

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.