Artificial Intelligence 23 min read

Python‑Based Scraping, Cleaning, Sentiment Analysis and Visualization of Douban Movie Reviews

The article walks through a full Python workflow that scrapes up to 500 Douban movie reviews for "Dying to Survive" and "Hidden Blade," cleans and stores them in pandas, performs SnowNLP sentiment analysis, and visualizes city distribution, rating trends, and word clouds with pyecharts.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Python‑Based Scraping, Cleaning, Sentiment Analysis and Visualization of Douban Movie Reviews

This article demonstrates a complete data‑analysis workflow on Chinese movie reviews from Douban, using the films "Dying to Survive" (《我不是药神》) and "Hidden Blade" (《邪不压正》) as case studies.

0. Requirement Analysis

Obtain review data via web scraping.

Clean and store the data.

Analyze city distribution, sentiment, and rating trends.

Practice pandas, web‑scraping and visualization skills.

1. Preparation

1.1 Web‑page analysis

Douban limits crawling: only 500 comments are publicly accessible, with a maximum of 40 requests per minute during the day and 60 at night. The start parameter in the URL controls pagination; each click on the “next page” button adds 20 to start , but manually incrementing by 10 also works.

1.2 Layout analysis

Key fields to extract:

User ID

Comment content

Score

Comment date

User city (requires visiting the user’s profile page)

2. Data acquisition – crawling

2.1 Get cookies

Douban requires authentication cookies. The cookies can be copied from Chrome’s developer tools.

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
cookies = {
    'cookie': 'bid=GOOb4vXwNcc; douban-fav-remind=1; viewed="27611266_26886337"; ps=y; ue="citpys原创分享@163.com"; push_noty_num=0; push_doumail_num=0; ap=1; loc-last-index-location-id="108288"; ll="108288"; dbcl2="187285881:N/y1wyPpmA8"; ck=4wlL'
}
url = "https://movie.douban.com/subject/" + str(id) + "/comments?start=" + str(page * 10) + "&limit=20&sort=new_score&status=P"
res = requests.get(url, headers=headers, cookies=cookies)
res.encoding = "utf-8"
if res.status_code == 200:
    print("\n第{}页短评爬取成功!".format(page + 1))
    print(url)
else:
    print("\n第{}页爬取失败!".format(page + 1))

2.3 Anti‑scraping delay

time.sleep(round(random.uniform(1, 2), 2))

2.4 Parsing logic

Because some comments have no score, the XPath for score may actually return the date. The script checks the format and swaps values when necessary.

name = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a/text()'.format(i))
score = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/span[2]/@title'.format(i))
date = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/span[3]/@title'.format(i))
if not re.compile('\d{4}-\d{2}-\d{2}').match(score[0]):
    date = score
    score = ["null"]
content = x.xpath('//*[@id="comments"]/div[{}]/div[2]/p/span/text()'.format(i))

2.5 Movie name extraction

pattern = re.compile('
.*?
.*?
(.*?) 短评
', re.S)
global movie_name
movie_name = re.findall(pattern, res.text)[0]

3. Data storage

The collected fields are stored in a pandas DataFrame and saved as a CSV file.

infos = {'name': name_list, 'city': city_list, 'content': content_list, 'score': score_list, 'date': date_list}
data = pd.DataFrame(infos, columns=['name', 'city', 'content', 'score', 'date'])
data.to_csv(str(ID) + "_comments.csv")

4. Data cleaning

City information is noisy (empty, overseas, malformed). The script filters Chinese characters, removes punctuation, and matches the remaining strings against the city list provided by pyecharts.

line = str.strip()
p2 = re.compile('[^\u4e00-\u9fa5]')
zh = " ".join(p2.split(line)).strip()
zh = ",".join(zh.split())
line = re.sub('[A-Za-z0-9!!,%\[\],。]', "", zh)

After cleaning, the script builds a dictionary result that counts occurrences of each city.

5. Sentiment analysis with SnowNLP

SnowNLP provides Chinese word segmentation, POS tagging, sentiment scoring, text classification, keyword extraction, summarization, etc. The sentiment score ranges from 0 (negative) to 1 (positive); scores below 0.5 are treated as negative.

attr, val = [], []
info = count_sentiment(csv_file)
info = sorted(info.items(), key=lambda x: x[0], reverse=False)
for each in info[:-1]:
    attr.append(each[0])
    val.append(each[1])
line = Line(csv_file+":影评情感分析")
line.add("", attr, val, is_smooth=True, is_more_utils=True)
line.render(csv_file+"_情感分析曲线图.html")

6. Visualization and interpretation

Using pyecharts, the following charts are generated:

Geo map (dot map) of comment‑origin cities.

Geo heatmap of comment density.

Bar chart ranking cities by comment count.

Pie chart of city distribution.

Line chart of daily rating trends.

Word‑clouds for each film.

Key observations:

Top 10 cities for "Dying to Survive": Beijing, Shanghai, Nanjing, Hangzhou, Shenzhen, Guangzhou, Chengdu, Changsha, Chongqing, Xi’an.

Top 10 cities for "Hidden Blade": Beijing, Shanghai, Guangzhou, Chengdu, Hangzhou, Nanjing, Xi’an, Shenzhen, Changsha, Harbin.

Sentiment distribution shows a strong positive bias (most scores >0.5).

Rating spikes occur within the first week of release, with a small pre‑release “preview” segment.

Word‑clouds highlight themes such as "China", "reality", "social", "hope" for "Dying to Survive" and frequent mentions of director Jiang Wen for "Hidden Blade".

7. Conclusion

The project reinforces pandas manipulation and web‑scraping techniques.

Building a domain‑specific sentiment corpus would improve analysis accuracy.

Pyecharts provides an attractive way to present geographic and statistical results.

All source code is available on GitHub: https://github.com/Ctipsy/DA_projects/tree/master/我不是药神

Pythonsentiment analysisData VisualizationWeb ScrapingpandasDoubanpyecharts
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.