Big Data 17 min read

How to Build a Python Web Scraper for Zhihu Answers and Generate Word Clouds

This article walks through the complete process of designing a Python web scraper to collect Zhihu answer data, parse author, ID and excerpt fields, store the results in CSV files, and finally visualize the text with a word‑cloud, including all necessary code snippets and explanations.

Python Crawling & Data Mining

Dec 5, 2020

How to Build a Python Web Scraper for Zhihu Answers and Generate Word Clouds

Scraper Design Process

Explore URL patterns

Attempt to access a specific page

Parse the data of interest

Store results to CSV

Combine and tidy the code

1. Explore URL patterns

Press F12 to open the developer tools.

Select the Network panel and view all answers.

Observe the request URLs shown in the panel.

For each URL, click Preview and compare the content with the current page.

Identify that the request uses the GET method and contains offset and limit parameters.

Example URL (notice the final offset line):

https://www.zhihu.com/api/v4/questions/432119474/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics%3Bsettings.table_of_content.enabled%3B&offset=3&limit=5&sort_by=default&platform=desktop

The offset parameter works like a page number; with 6200+ answers and 5 items per page there are about 1240 pages.

2. Attempt to access a specific page

import requests

template = 'https://www.zhihu.com/api/v4/questions/432119474/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics%3Bsettings.table_of_content.enabled%3B&offset={offset}&limit=5&sort_by=default&platform=desktop'

headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36'}

url = template.format(offset=1)
resp = requests.get(url, headers=headers)
print(resp)

<Response [200]>

The response contains JSON data that can be parsed with resp.json().

3. Parse the data of interest

We extract only the author , id and excerpt fields.

for info in resp.json()['data']:
    author = info['author']
    Id = info['id']
    text = info['excerpt']
    data = {'author': author, 'id': Id, 'text': text}
    print(data)

Sample output (truncated):

{'author': {...}, 'id': 1597225705, 'text': '...'}
{'author': {...}, 'id': 1597380398, 'text': '...'}

4. Store results to CSV

import csv

csvf = open('zhihu.csv', 'a+', encoding='utf-8', newline='')
fieldnames = ['author', 'id', 'text']
writer = csv.DictWriter(csvf, fieldnames=fieldnames)
writer.writeheader()

for info in resp.json()['data']:
    author = info['author']
    Id = info['id']
    text = info['excerpt']
    writer.writerow({'author': author, 'id': Id, 'text': text})

csvf.close()

5. Combine all code

import requests
import csv
import time

csvf = open('zhihu.csv', 'a+', encoding='utf-8', newline='')
fieldnames = ['author', 'id', 'text']
writer = csv.DictWriter(csvf, fieldnames=fieldnames)
writer.writeheader()

headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36'}

template = 'https://www.zhihu.com/api/v4/questions/432119474/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics%3Bsettings.table_of_content.enabled%3B&offset={offset}&limit=5&sort_by=default&platform=desktop'

for page in range(1, 100):
    url = template.format(offset=page)
    resp = requests.get(url, headers=headers)
    for info in resp.json()['data']:
        author = info['author']
        Id = info['id']
        text = info['excerpt']
        writer.writerow({'author': author, 'id': Id, 'text': text})
    time.sleep(1)

csvf.close()

Test the CSV

import pandas as pd

df = pd.read_csv('zhihu.csv')
print(df.head())

Generate a Word Cloud

Combine all text fields and create a word‑cloud to visualise the overall sentiment.

import jieba
import json
from pyecharts import options as opts
from pyecharts.charts import WordCloud

text_contents = ''.join(df['text'])
words = jieba.lcut(text_contents)
words = [w for w in words if len(w) > 1]
wordfreqs = [(w, words.count(w)) for w in set(words)]

c = (
    WordCloud()
    .add('', wordfreqs, word_size_range=[10, 70])
    .set_global_opts(title_opts=opts.TitleOpts(title='Zhihu Word Cloud'))
    .render_notebook()
)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data mining CSV Web Scraping zhihu word cloud

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.