Backend Development 6 min read

Scraping Douban Movie Comments with Selenium and Generating a Word Cloud in Python

This tutorial demonstrates how to use Selenium to crawl short comments from a Douban movie page, extract them via XPath, paginate through results, and finally create a visual word cloud using Python's wordcloud library.

Python Programming Learning Circle

Dec 6, 2021

Scraping Douban Movie Comments with Selenium and Generating a Word Cloud in Python

The article begins with a brief introduction, noting the popularity of the TV series "Squid Game" and the author's curiosity about its Douban reviews, which serves as a practice case for Selenium web scraping.

It then guides the reader to open Google Chrome, press F12 to access developer tools, and inspect the page structure. The short comments are identified inside <span class="short"> elements, and the author shows how to generate XPath expressions directly from the browser.

Key code snippets are provided to launch Chrome with Selenium, bypass automation detection, and navigate to the target URL:

# 待打开的页面
url = 'https://movie.douban.com/subject/34812928/comments?limit=20&status=P&sort=new_score'
# 躲避智能检测
option = webdriver.ChromeOptions()
# option.headless = True
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=option)
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
    'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
})
driver.get(url)

To retrieve the comment texts, the following XPath is used: //span[@class="short"] The corresponding Python code iterates over the found elements and concatenates their text:

options = driver.find_elements(By.XPATH, '//span[@class="short"]')
for i in options:
    text = text + i.text

For pagination, the XPath of the "next page" button is identified as //*[@id="paginator"]/a, and the click action is performed:

nextpage = driver.find_element(By.XPATH, '//*[@id="paginator"]/a')
nextpage.click()

A complete script combines these steps, repeatedly fetching comments across multiple pages (limited to 10 iterations) and then passes the aggregated text to a word‑cloud utility.

The word‑cloud utility (wordcloudutil.py) defines two functions: trans_CN to segment Chinese text with jieba and insert spaces, and getWordCloud to generate and display a word cloud using a mask image and a Chinese font.

# -*- coding = utf-8 -*-
# @Time : 2021/10/9 20:54
# @Author : xiaow
# @File : wordcloudutil.py

from wordcloud import WordCloud
import PIL.Image as image
import numpy as np
import jieba

def trans_CN(text):
    # 接收分词的字符串
    word_list = jieba.cut(text)
    # 分词后在单独个体之间加上空格
    result = " ".join(word_list)
    return result

def getWordCloud(text):
    # print(text)
    text = trans_CN(text)
    # 词云背景图
    mask = np.array(image.open("E://file//pics//mask3.jpg"))
    wordcloud = WordCloud(
        mask=mask,
        # 字体样式文件
        font_path="C:\\Windows\\Fonts\\STXINGKA.TTF",
        background_color='white',
    ).generate(text)
    image_produce = wordcloud.to_image()
    image_produce.show()

The main scraping script (test.py) puts everything together: it opens the Douban comments page, repeatedly extracts comment texts, clicks the next‑page button, stops after ten pages, prints the collected text, and finally calls wordcloudutil.getWordCloud to visualize the data.

# -*- coding = utf-8 -*-
# @Time : 2021/6/27 22:29
# @Author : xiaow
# @File : test.py

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from api import wordcloudutil

if __name__ == '__main__':
    url = 'https://movie.douban.com/subject/34812928/comments?limit=20&status=P&sort=new_score'
    # 躲避智能检测
    option = webdriver.ChromeOptions()
    # option.headless = True
    option.add_experimental_option('excludeSwitches', ['enable-automation'])
    option.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=option)
    driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
        'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
    })
    driver.get(url)
    text = ''
    j = 0
    while True:
        time.sleep(1)
        driver.switch_to.window(driver.window_handles[0])
        options = driver.find_elements(By.XPATH, '//span[@class="short"]')
        for i in options:
            text = text + i.text
        time.sleep(2)
        nextpage = driver.find_element(By.XPATH, '//*[@id="paginator"]/a')
        nextpage.click()
        j = j + 1
        if j > 10:
            break
    print(text)
    wordcloudutil.getWordCloud(text)

After execution, the collected comments are visualized as a word cloud image, demonstrating a simple yet effective pipeline from web scraping to data visualization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python automation Web Scraping douban word cloud

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.