Artificial Intelligence 10 min read

How to Scrape Douban Reviews and Uncover Hidden Sentiment Trends with Python

This article demonstrates how to crawl Douban short reviews for the TV show "Actors Please Take Your Place" Season 2, clean and deduplicate the data, apply Baidu's SKEP sentiment model, and visualize word clouds, rating distributions, posting times, and sentiment scores, providing full Python code for replication.

Python Crawling & Data Mining

Nov 23, 2020

How to Scrape Douban Reviews and Uncover Hidden Sentiment Trends with Python

Introduction

The popular Chinese variety show "Actors Please Take Your Place" Season 2 has generated extensive discussion on Douban. This article crawls short reviews (positive, neutral, and negative) from Douban, performs visualization and sentiment analysis, and offers the complete Python code for replication.

Visualization Analysis

Directors Mentioned More Than Actors

Word‑cloud analysis shows the term "director" appears more frequently than "actor", indicating that discussion focuses heavily on the directors. Positive keywords such as "acting" and "like" coexist with negative words like "disgusting" and "trash".

Negative Reviews Over Half

Review classification reveals 55% negative, 21% neutral, and 24% positive comments, reflecting disappointment compared with the first season and criticism of certain on‑screen actions.

Most Comments Posted Late Night

Time‑distribution analysis shows that 27.89% of comments are posted between 22:00 and 24:00.

Positive Reviews Receive Few Likes

Five‑star positive reviews obtained only 828 likes, while one‑star negative reviews received 3,776 likes.

Guo Jingming Mentioned Most

Among the personalities, Guo Jingming is referenced 319 times, surpassing other directors and participants.

Sentiment Score Around 0.4, Peaks at Early Morning

Using Baidu's SKEP sentiment model, the average sentiment score fluctuates around 0.4, with a noticeable positive peak around 05:00 am.

Technical Implementation

Data Acquisition

def get_page_info(start_num, type):
    url = "https://movie.douban.com/subject/" + movie_id + "/comments?percent_type=" + type + "&start=" + str(start_num) + "&limit=20&status=P&sort=new_score&comments_only=1&ck=myI8"
    print(url)
    header = {
        "Accept": "application/json, text/plain, */*",
        "Accept-Language": "zh-CN,zh;q=0.9",
        "Connection": "keep-alive",
        "Host": "movie.douban.com",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
        "Cookie": "..."
    }
    response = requests.get(url, headers=header)
    req_parser = BeautifulSoup(response.content.decode('unicode_escape'), features="html.parser")
    comments = req_parser.find_all('div', class_="comment-item")
    # ...

Data Cleaning

Import the CSV, select relevant columns, convert data types, and apply a custom mechanical compression function to remove duplicated substrings.

import pandas as pd
df = pd.read_csv("/菜J学Python/豆瓣/35163988.csv")
df = df[["user_name","comment_voted","movie_star","comment_time","comment"]]
df["comment_time"] = pd.to_datetime(df["comment_time"])
df["comment"] = df["comment"].astype('str')

def yasuo(st):
    for i in range(1, int(len(st)/2)+1):
        for j in range(len(st)):
            if st[j:j+i] == st[j+i:j+2*i]:
                k = j + i
                while st[k:k+i] == st[k+i:k+2*i] and k < len(st):
                    k = k + i
                st = st[:j] + st[k:]
    return st

df["comment"] = df["comment"].apply(yasuo)

Sentiment Analysis

import paddlehub as hub
senta = hub.Module(name="senta_bilstm")
texts = df['comment'].tolist()
input_data = {'text': texts}
res = senta.sentiment_classify(data=input_data)
df['pos_p'] = [x['positive_probs'] for x in res]

Data Visualization

# Define word‑cut function, load stop‑words, add custom words, generate word cloud
import jieba, stylecloud

def get_cut_words(content_series):
    stop_words = []
    with open('./stop_words.txt', 'r', encoding='utf-8') as f:
        for line in f.readlines():
            stop_words.append(line.strip())
    my_words = ['', '']
    for w in my_words:
        jieba.add_word(w)
    my_stop_words = ['节目', '中国', '一部']
    stop_words.extend(my_stop_words)
    word_num = jieba.lcut(content_series.str.cat(sep='。'), cut_all=False)
    return [i for i in word_num if i not in stop_words and len(i) >= 2]

text1 = get_cut_words(df['comment'])
stylecloud.gen_stylecloud(text=' '.join(text1), max_words=200, collocations=False,
                          font_path='字酷堂清楷体.ttf', icon_name='fas fa-video',
                          size=653, output_name='./演员2词云图.png')

Disclaimer

This analysis is for learning and research purposes only; conclusions are for reference.

The author’s knowledge of the film industry is limited, so descriptions may be imperfect.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

NLP douban web-scraping data-visualization sentiment-analysis

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.