Backend Development 10 min read

How to Build a Python Zhihu Web Scraper: Login, User Data, and More

This article walks through building a Python web scraper for Zhihu, covering login simulation, extracting user profiles, answer likers, followers, avatars, and all answers of a question, and storing the collected data in SQLite, while highlighting challenges like captcha and anti‑scraping limits.

MaGe Linux Operations

May 23, 2017

How to Build a Python Zhihu Web Scraper: Login, User Data, and More

Introduction

Web crawling involves automatically fetching information from the web using scripts. It is essential for tasks like machine learning and data mining, where large amounts of data are needed. This guide summarizes how to use Python to scrape data from Zhihu.

Tools Used

The tutorial relies on Python packages requests for HTTP requests, BeautifulSoup4 for parsing HTML, and json for handling JSON data.

Simulating Login

A login function is required because many Zhihu pages are inaccessible without authentication. Users must provide their account credentials in the function’s data parameter. The first run prompts for a captcha, after which a cookiefile and zhihucaptcha.gif are saved. Subsequent runs reuse the saved cookie. A global session object s = requests.session() maintains the logged‑in state.

Fetching User Basic Information

Each Zhihu user has a unique ID (e.g., marcovaldong). Accessing https://www.zhihu.com/people/USER_ID reveals profile details such as location, industry, gender, education, vote counts, and follow statistics. The function get_userInfo(userID) returns a list of 19 fields, including nickname, ID, location, industry, gender, company, position, school, major, upvotes, thanks, questions asked, answers given, articles written, collections, public edits, followers, followees, and profile views.

Getting All Likers of an Answer

Each Zhihu question and answer has a unique numeric ID. By inspecting network requests when clicking the “ 5321 people upvoted ” link, one can obtain a URL like https://www.zhihu.com/answer/5430533/voters_profile. The numeric part (5430533) is used to request a JSON response containing liker information. The JSON provides pagination URLs, allowing extraction of each liker’s nickname, profile URL, upvote count, thank count, question count, and answer count. The data are saved to a text file named after the answer’s ID. Anonymous or deleted users are noted as missing.

Fetching Followers and Followees

Similar techniques retrieve a user’s followers and followees. The follower‑fetching function may fail after reaching around 10,020 followers due to Zhihu’s access limits, while the followee function works without apparent issues.

Extracting User Avatar

Given a user’s unique ID, a function parses the user’s profile page to locate the avatar image URL, downloads the image, and saves it locally using the user ID as the filename.

Scraping All Answers of a Question

Providing a question’s unique ID to a dedicated function retrieves all answers under that question. Only the textual content is saved (images are omitted), and each answer is stored in a separate .txt file named after the answer’s author ID.

Storing Data in SQLite

After gathering user information, the tutorial demonstrates a simple use of sqlite3 to create a table and insert the collected fields for later querying.

Future work includes large‑scale crawling of user and follow relationships, visualizing the network, learning the Scrapy framework, and extending the scraper to other platforms such as Weibo. Anti‑anti‑scraping measures like request throttling will also be explored.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQLite requests zhihu beautifulsoup web-scraping data-extraction

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.