Backend Development 9 min read

How to Build a Python Zhihu Crawler: Login, User Data, and Answer Likes

This guide walks through using Python's requests and BeautifulSoup libraries to simulate Zhihu login, extract user profiles, retrieve answer likers, download avatars, fetch all answers for a question, and store the collected data in a SQLite database.

MaGe Linux Operations

Oct 8, 2019

How to Build a Python Zhihu Crawler: Login, User Data, and Answer Likes

Simulated Login

To crawl Zhihu, first simulate a login using a function that sends a POST request with your credentials; the first run requires manual captcha entry, after which a cookie file and captcha image are saved for automatic future logins.

Get User Basic Info

Each Zhihu user has a unique ID; by accessing https://www.zhihu.com/people/ {userID} you can retrieve profile details such as location, industry, gender, education, vote counts, followers, and followees. The provided get_userInfo(userID) function returns a list of 19 fields including nickname, ID, location, company, position, school, major, vote counts, and follower statistics.

Get Answer Likes List

To obtain all users who liked a specific answer, first identify the answer's unique ID from its URL, then locate the voters_profile endpoint (e.g., https://www.zhihu.com/answer/5430533/voters_profile). The JSON response contains liker information and pagination URLs; each request returns up to 20 likers, including nickname, profile URL, vote count, thanks count, and question/answer counts. The script saves the data to a text file named after the answer ID, handling anonymous or deleted users by noting missing information.

Extract User Avatar

Given a user ID, a function parses the user's profile page to locate the avatar image URL, downloads the image, and saves it locally using the user ID as the filename.

Fetch All Answers of a Question

By providing a question ID, another function iterates through all answer pages, extracts the textual content of each answer (excluding images), and saves each answer to a separate text file named after the answer's author ID.

Database Storage

After gathering the data, the author stores user information in a SQLite3 database table, enabling easy retrieval and further analysis such as visualizing follow relationships among prominent users. Future work includes scaling the crawler, handling anti‑scraping measures, and exploring Scrapy or Weibo scraping.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python SQLite Web Scraping zhihu beautifulsoup

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.