How to Scrape Douban Reviews and Uncover Hidden Sentiment Trends with Python
This article demonstrates how to crawl Douban short reviews for the TV show "Actors Please Take Your Place" Season 2, clean and deduplicate the data, apply Baidu's SKEP sentiment model, and visualize word clouds, rating distributions, posting times, and sentiment scores, providing full Python code for replication.
Introduction
The popular Chinese variety show "Actors Please Take Your Place" Season 2 has generated extensive discussion on Douban. This article crawls short reviews (positive, neutral, and negative) from Douban, performs visualization and sentiment analysis, and offers the complete Python code for replication.
Visualization Analysis
Directors Mentioned More Than Actors
Word‑cloud analysis shows the term "director" appears more frequently than "actor", indicating that discussion focuses heavily on the directors. Positive keywords such as "acting" and "like" coexist with negative words like "disgusting" and "trash".
Negative Reviews Over Half
Review classification reveals 55% negative, 21% neutral, and 24% positive comments, reflecting disappointment compared with the first season and criticism of certain on‑screen actions.
Most Comments Posted Late Night
Time‑distribution analysis shows that 27.89% of comments are posted between 22:00 and 24:00.
Positive Reviews Receive Few Likes
Five‑star positive reviews obtained only 828 likes, while one‑star negative reviews received 3,776 likes.
Guo Jingming Mentioned Most
Among the personalities, Guo Jingming is referenced 319 times, surpassing other directors and participants.
Sentiment Score Around 0.4, Peaks at Early Morning
Using Baidu's SKEP sentiment model, the average sentiment score fluctuates around 0.4, with a noticeable positive peak around 05:00 am.
Technical Implementation
Data Acquisition
def get_page_info(start_num, type):
url = "https://movie.douban.com/subject/" + movie_id + "/comments?percent_type=" + type + "&start=" + str(start_num) + "&limit=20&status=P&sort=new_score&comments_only=1&ck=myI8"
print(url)
header = {
"Accept": "application/json, text/plain, */*",
"Accept-Language": "zh-CN,zh;q=0.9",
"Connection": "keep-alive",
"Host": "movie.douban.com",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
"Cookie": "..."
}
response = requests.get(url, headers=header)
req_parser = BeautifulSoup(response.content.decode('unicode_escape'), features="html.parser")
comments = req_parser.find_all('div', class_="comment-item")
# ...Data Cleaning
Import the CSV, select relevant columns, convert data types, and apply a custom mechanical compression function to remove duplicated substrings.
import pandas as pd
df = pd.read_csv("/菜J学Python/豆瓣/35163988.csv")
df = df[["user_name","comment_voted","movie_star","comment_time","comment"]]
df["comment_time"] = pd.to_datetime(df["comment_time"])
df["comment"] = df["comment"].astype('str')
def yasuo(st):
for i in range(1, int(len(st)/2)+1):
for j in range(len(st)):
if st[j:j+i] == st[j+i:j+2*i]:
k = j + i
while st[k:k+i] == st[k+i:k+2*i] and k < len(st):
k = k + i
st = st[:j] + st[k:]
return st
df["comment"] = df["comment"].apply(yasuo)Sentiment Analysis
import paddlehub as hub
senta = hub.Module(name="senta_bilstm")
texts = df['comment'].tolist()
input_data = {'text': texts}
res = senta.sentiment_classify(data=input_data)
df['pos_p'] = [x['positive_probs'] for x in res]Data Visualization
# Define word‑cut function, load stop‑words, add custom words, generate word cloud
import jieba, stylecloud
def get_cut_words(content_series):
stop_words = []
with open('./stop_words.txt', 'r', encoding='utf-8') as f:
for line in f.readlines():
stop_words.append(line.strip())
my_words = ['', '']
for w in my_words:
jieba.add_word(w)
my_stop_words = ['节目', '中国', '一部']
stop_words.extend(my_stop_words)
word_num = jieba.lcut(content_series.str.cat(sep='。'), cut_all=False)
return [i for i in word_num if i not in stop_words and len(i) >= 2]
text1 = get_cut_words(df['comment'])
stylecloud.gen_stylecloud(text=' '.join(text1), max_words=200, collocations=False,
font_path='字酷堂清楷体.ttf', icon_name='fas fa-video',
size=653, output_name='./演员2词云图.png')Disclaimer
This analysis is for learning and research purposes only; conclusions are for reference.
The author’s knowledge of the film industry is limited, so descriptions may be imperfect.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
