Backend Development 13 min read

How to Scrape Douban Movie Reviews and Visualize Them with a Word Cloud in Python

This tutorial walks through using Python 3.5 to fetch the latest movies from Douban, extract their IDs and titles, crawl user comments, clean the text with regular expressions, segment Chinese words using Jieba, remove stopwords, compute word frequencies, and finally generate a word‑cloud visualization of the reviews.

MaGe Linux Operations

Aug 18, 2017

How to Scrape Douban Movie Reviews and Visualize Them with a Word Cloud in Python

Introduction

After recently learning Python, I created a small project to practice by analyzing Douban movie reviews. I noticed that Wolf Warrior 2 topped the latest box‑office rankings, so I decided to scrape the movie list and its short comments for analysis.

Goal Overview

The project consists of three main steps:

Fetch web page data

Clean the data

Display the results with a word cloud

The code was written for Python 3.5.

1. Fetching Web Page Data

First, access the Douban now‑playing page using urllib.request:

from urllib import request
resp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')
html_data = resp.read().decode('utf-8')

Parse the HTML with BeautifulSoup:

from bs4 import BeautifulSoup as bs
soup = bs(html_data, 'html.parser')

Locate the div with id="nowplaying" and extract each li element (class="list-item") to obtain the movie ID ( data-subject) and name (the alt attribute of the img tag):

nowplaying_movie = soup.find_all('div', id='nowplaying')
nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_='list-item')
nowplaying_list = []
for item in nowplaying_movie_list:
    nowplaying_dict = {}
    nowplaying_dict['id'] = item['data-subject']
    tag_img_item = item.find_all('img')[0]
    nowplaying_dict['name'] = tag_img_item['alt']
    nowplaying_list.append(nowplaying_dict)
return nowplaying_list

Next, fetch the short comments for each movie. The comment URL follows the pattern

https://movie.douban.com/subject/{movieId}/comments?start={start}&limit=20

. The function below builds the URL, retrieves the page, and extracts the comment text from the p tags inside div class="comment":

def getCommentsById(movieId, pageNum):
    if pageNum > 0:
        start = (pageNum - 1) * 20
    else:
        return False
    requrl = 'https://movie.douban.com/subject/' + movieId + '/comments?start=' + str(start) + '&limit=20'
    resp = request.urlopen(requrl)
    html_data = resp.read().decode('utf-8')
    soup = bs(html_data, 'html.parser')
    comment_div_lits = soup.find_all('div', class_='comment')
    eachCommentList = []
    for item in comment_div_lits:
        if item.find_all('p')[0].string is not None:
            eachCommentList.append(item.find_all('p')[0].string)
    return eachCommentList

In the main routine, loop over the first movie and retrieve the first ten pages of comments, concatenating them into a single string:

commentList = []
NowPlayingMovie_list = getNowPlayingMovie_list()
for i in range(10):
    num = i + 1
    commentList_temp = getCommentsById(NowPlayingMovie_list[0]['id'], num)
    commentList.append(commentList_temp)
comments = ''
for k in range(len(commentList)):
    comments = comments + str(commentList[k])).strip()

2. Data Cleaning

Remove punctuation and non‑Chinese characters using a regular expression that matches Chinese characters:

import re
pattern = re.compile(r'[一-龥]+')
filterdata = re.findall(pattern, comments)
cleaned_comments = ''.join(filterdata)

Segment the cleaned Chinese text with jieba and store the result in a pandas DataFrame:

import jieba
import pandas as pd
segment = jieba.lcut(cleaned_comments)
words_df = pd.DataFrame({'segment': segment})

Load a stop‑words list from stopwords.txt and filter them out of the DataFrame:

stopwords = pd.read_csv('stopwords.txt', index_col=False, quoting=3, sep='\t', names=['stopword'], encoding='utf-8')
words_df = words_df[~words_df['segment'].isin(stopwords['stopword'])]

3. Word‑Frequency Statistics

Group by the segmented words and count occurrences, then sort descending:

import numpy as np
words_stat = words_df.groupby(['segment']).agg({'计数': np.size})
words_stat = words_stat.reset_index().sort_values(by=['计数'], ascending=False)

4. Visualizing with a Word Cloud

Generate a word cloud using the WordCloud library, specifying a Chinese font ( simhei.ttf) and white background:

from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(font_path='simhei.ttf', background_color='white', max_font_size=80)
word_frequence = {x[0]: x[1] for x in words_stat.head(1000).values}
word_frequence_list = []
for key in word_frequence:
    temp = (key, word_frequence[key])
    word_frequence_list.append(temp)
wordcloud = wordcloud.fit_words(word_frequence_list)
plt.imshow(wordcloud)
plt.show()

The resulting image clearly reflects the sentiment and key topics of the Wolf Warrior 2 reviews.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data cleaning douban wordcloud web-scraping

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.