Backend Development 9 min read

How to Build a Python Zhihu Crawler: Login, User Data, and Answer Likes Extraction

This article walks through creating a Python web crawler for Zhihu, covering simulated login, fetching user profiles, extracting answer likers, downloading avatars, gathering all answers for a question, and storing the collected data in a SQLite database.

MaGe Linux Operations

Oct 8, 2018

How to Build a Python Zhihu Crawler: Login, User Data, and Answer Likes Extraction

Simulated Login

To scrape Zhihu you must first simulate a login because many pages are inaccessible without authentication. The author provides a login function that uses the user's credentials and handles the captcha on the first run, storing cookies and the captcha image for subsequent automated logins.

Fetching User Basic Information

Each Zhihu user has a unique ID (e.g., marcovaldong). By accessing https://www.zhihu.com/people/{userID} you can retrieve profile details such as location, industry, gender, education, vote counts, follow statistics, etc. The function get_userInfo(userID) returns a list of 19 fields including nickname, ID, residence, industry, gender, company, position, school, major, upvotes, thanks, questions asked, answers given, articles, collections, edits, followers, followees, and profile views.

Extracting All Likers of an Answer

The article explains how to obtain the list of users who liked a specific answer. Each question and answer has a unique numeric ID extracted from its URL. By inspecting the network request for the "5321 people liked" button, you can find a URL like https://www.zhihu.com/answer/5430533/voters_profile, where 5430533 is the ID used to fetch the likers.

Using requests to GET this endpoint returns a JSON payload containing liker information. The payload is paginated, with each request returning up to 20 users and a link to the next page. The script extracts each liker’s nickname, profile URL (user ID), upvote count, thank count, question count, and answer count. Anonymous or deleted users are noted as missing.

Downloading User Avatars

Given a user’s unique ID, a function parses the user’s profile page to locate the avatar image URL, downloads the image, and saves it locally using the user ID as the filename.

Fetching All Answers of a Question

By providing a question’s unique ID, another function retrieves all answers under that question. Only the textual content is saved (images are omitted), and each answer is written to a separate .txt file named after the answerer’s ID.

Storing Data in a Database

After gathering the data, the author demonstrates storing user information in a SQLite database. The basic schema creates a table to hold the 19 fields returned by get_userInfo. With SQLite familiar, the next steps include bulk collection of user and follower data and visualizing the follow network, possibly using Scrapy for larger-scale crawling.

Anti‑Scraping Considerations

The author notes that Zhihu may block frequent requests, prompting captcha challenges. Future improvements will involve rate‑limiting, rotating proxies, and other anti‑anti‑scraping techniques.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQLite zhihu

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.