Python Text Mining & Sentiment Analysis of Douban Reviews for “Letter to Grandma”
This article demonstrates how to use Python to crawl 7,275 Douban short reviews of the film “Letter to Grandma”, clean the data, generate a word‑cloud and frequency bar chart, and perform sentiment analysis that reveals over 91% of comments are positive.
As a Python enthusiast, the author crawled 7,275 short comments for the movie “Letter to Grandma” from Douban and performed text mining and sentiment analysis.
First, the required libraries are installed:
pip install requests beautifulsoup4 pandas jieba wordcloud matplotlib snownlp lxmlThe spider (not fully shown) is executed with the user’s own cookie to fetch the comments and save them as 豆瓣_给阿嬷的情书_短评.csv.
Data cleaning is then performed:
import pandas as pd
import re
df = pd.read_csv("豆瓣_给阿嬷的情书_短评.csv", encoding="utf-8-sig")
df = df.dropna(subset=["comment"]).copy()
df = df[df["comment"].str.len() > 3]
def clean_comment(text):
text = re.sub(r'[\U00010000-\U0010ffff]', '', text)
text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9]', '', text)
return text.strip()
df["clean_comment"] = df["comment"].apply(clean_comment)
df = df.drop_duplicates(subset=["clean_comment"])
print(f"✅ 清洗完成,剩余有效评论:{len(df)} 条")For word‑frequency statistics and visualization, all cleaned comments are concatenated, segmented with jieba, stop words are removed, and the top‑20 words are counted. A word cloud and a bar chart are generated:
all_text = "".join(df["clean_comment"].tolist())
stop_words = {"的","了","我","是","很","都","就","也","还","在","和","不","有","着","看","感觉","觉得","真的","这部","电影"}
words = jieba.lcut(all_text)
valid_words = [w for w in words if w not in stop_words and len(w) > 1]
word_count = Counter(valid_words)
top20 = word_count.most_common(20)
print("
🔥 高频TOP20词汇:
", top20)
# word cloud
wc = WordCloud(background_color="white", font_path="simhei.ttf", width=1200, height=700, max_words=300, colormap="Oranges")
wc.generate(" ".join(valid_words))
wc.to_file("豆瓣_阿嬷情书_词云图.png")
# bar chart
names = [x[0] for x in top20]
nums = [x[1] for x in top20]
plt.figure(figsize=(14,6))
plt.bar(names, nums, color="#ff9966")
plt.title("《给阿嬷的情书》豆瓣短评高频词汇TOP20", fontsize=16)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig("豆瓣_高频词汇柱状图.png")
plt.show()The top‑20 words include “南枝”, “女性”, “潮汕”, “这个”, “故事”, confirming that the film’s core characters, female‑growth theme, and regional background attract the most attention.
Sentiment analysis is carried out with SnowNLP:
def get_sentiment(text):
score = SnowNLP(text).sentiments
if score > 0.5:
return "温暖正向", score
elif score == 0.5:
return "中性平淡", score
else:
return "遗憾伤感", score
df[["sent_type", "sent_score"]] = pd.DataFrame(df["clean_comment"].apply(get_sentiment).tolist(), index=df.index)
sent_stat = df["sent_type"].value_counts()
print("
🔥 情感分布统计:
", sent_stat)
plt.figure(figsize=(8,8))
plt.pie(sent_stat.values, labels=sent_stat.index, autopct="%1.2f%%", colors=["#ffb380","#ffe6cc","#ff8080"])
plt.title("《给阿嬷的情书》豆瓣短评情感分布", fontsize=16)
plt.savefig("豆瓣_情感分布饼图.png")
plt.show()The sentiment distribution shows 91.49% of comments classified as “温暖正向” (warm‑positive) and 8.51% as other sentiments, indicating an overwhelmingly positive audience reception.
Finally, the cleaned data with sentiment labels are saved to 豆瓣_阿嬷情书_最终分析数据.csv, and three visualizations—word cloud, frequency bar chart, and sentiment pie chart—are generated.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
