How to Build a Python Web Scraper for Zhihu Answers and Generate Word Clouds
This article walks through the complete process of designing a Python web scraper to collect Zhihu answer data, parse author, ID and excerpt fields, store the results in CSV files, and finally visualize the text with a word‑cloud, including all necessary code snippets and explanations.
Scraper Design Process
Explore URL patterns
Attempt to access a specific page
Parse the data of interest
Store results to CSV
Combine and tidy the code
1. Explore URL patterns
Press F12 to open the developer tools.
Select the Network panel and view all answers.
Observe the request URLs shown in the panel.
For each URL, click Preview and compare the content with the current page.
Identify that the request uses the GET method and contains offset and limit parameters.
Example URL (notice the final offset line):
https://www.zhihu.com/api/v4/questions/432119474/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics%3Bsettings.table_of_content.enabled%3B&offset=3&limit=5&sort_by=default&platform=desktopThe offset parameter works like a page number; with 6200+ answers and 5 items per page there are about 1240 pages.
2. Attempt to access a specific page
import requests
template = 'https://www.zhihu.com/api/v4/questions/432119474/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics%3Bsettings.table_of_content.enabled%3B&offset={offset}&limit=5&sort_by=default&platform=desktop'
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36'}
url = template.format(offset=1)
resp = requests.get(url, headers=headers)
print(resp) <Response [200]>The response contains JSON data that can be parsed with resp.json().
3. Parse the data of interest
We extract only the author , id and excerpt fields.
for info in resp.json()['data']:
author = info['author']
Id = info['id']
text = info['excerpt']
data = {'author': author, 'id': Id, 'text': text}
print(data)Sample output (truncated):
{'author': {...}, 'id': 1597225705, 'text': '...'}
{'author': {...}, 'id': 1597380398, 'text': '...'}4. Store results to CSV
import csv
csvf = open('zhihu.csv', 'a+', encoding='utf-8', newline='')
fieldnames = ['author', 'id', 'text']
writer = csv.DictWriter(csvf, fieldnames=fieldnames)
writer.writeheader()
for info in resp.json()['data']:
author = info['author']
Id = info['id']
text = info['excerpt']
writer.writerow({'author': author, 'id': Id, 'text': text})
csvf.close()5. Combine all code
import requests
import csv
import time
csvf = open('zhihu.csv', 'a+', encoding='utf-8', newline='')
fieldnames = ['author', 'id', 'text']
writer = csv.DictWriter(csvf, fieldnames=fieldnames)
writer.writeheader()
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36'}
template = 'https://www.zhihu.com/api/v4/questions/432119474/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics%3Bsettings.table_of_content.enabled%3B&offset={offset}&limit=5&sort_by=default&platform=desktop'
for page in range(1, 100):
url = template.format(offset=page)
resp = requests.get(url, headers=headers)
for info in resp.json()['data']:
author = info['author']
Id = info['id']
text = info['excerpt']
writer.writerow({'author': author, 'id': Id, 'text': text})
time.sleep(1)
csvf.close()Test the CSV
import pandas as pd
df = pd.read_csv('zhihu.csv')
print(df.head())Generate a Word Cloud
Combine all text fields and create a word‑cloud to visualise the overall sentiment.
import jieba
import json
from pyecharts import options as opts
from pyecharts.charts import WordCloud
text_contents = ''.join(df['text'])
words = jieba.lcut(text_contents)
words = [w for w in words if len(w) > 1]
wordfreqs = [(w, words.count(w)) for w in set(words)]
c = (
WordCloud()
.add('', wordfreqs, word_size_range=[10, 70])
.set_global_opts(title_opts=opts.TitleOpts(title='Zhihu Word Cloud'))
.render_notebook()
)Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
