Scraping Douban Movie Comments with Selenium and Generating a Word Cloud in Python
This tutorial demonstrates how to use Selenium to crawl short comments from a Douban movie page, extract them via XPath, paginate through results, and finally create a visual word cloud using Python's wordcloud library.
The article begins with a brief introduction, noting the popularity of the TV series "Squid Game" and the author's curiosity about its Douban reviews, which serves as a practice case for Selenium web scraping.
It then guides the reader to open Google Chrome, press F12 to access developer tools, and inspect the page structure. The short comments are identified inside <span class="short"> elements, and the author shows how to generate XPath expressions directly from the browser.
Key code snippets are provided to launch Chrome with Selenium, bypass automation detection, and navigate to the target URL:
# 待打开的页面
url = 'https://movie.douban.com/subject/34812928/comments?limit=20&status=P&sort=new_score'
# 躲避智能检测
option = webdriver.ChromeOptions()
# option.headless = True
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=option)
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
})
driver.get(url)To retrieve the comment texts, the following XPath is used:
//span[@class="short"]The corresponding Python code iterates over the found elements and concatenates their text:
options = driver.find_elements(By.XPATH, '//span[@class="short"]')
for i in options:
text = text + i.textFor pagination, the XPath of the "next page" button is identified as //*[@id="paginator"]/a , and the click action is performed:
nextpage = driver.find_element(By.XPATH, '//*[@id="paginator"]/a')
nextpage.click()A complete script combines these steps, repeatedly fetching comments across multiple pages (limited to 10 iterations) and then passes the aggregated text to a word‑cloud utility.
The word‑cloud utility (wordcloudutil.py) defines two functions: trans_CN to segment Chinese text with jieba and insert spaces, and getWordCloud to generate and display a word cloud using a mask image and a Chinese font.
# -*- coding = utf-8 -*-
# @Time : 2021/10/9 20:54
# @Author : xiaow
# @File : wordcloudutil.py
from wordcloud import WordCloud
import PIL.Image as image
import numpy as np
import jieba
def trans_CN(text):
# 接收分词的字符串
word_list = jieba.cut(text)
# 分词后在单独个体之间加上空格
result = " ".join(word_list)
return result
def getWordCloud(text):
# print(text)
text = trans_CN(text)
# 词云背景图
mask = np.array(image.open("E://file//pics//mask3.jpg"))
wordcloud = WordCloud(
mask=mask,
# 字体样式文件
font_path="C:\\Windows\\Fonts\\STXINGKA.TTF",
background_color='white',
).generate(text)
image_produce = wordcloud.to_image()
image_produce.show()The main scraping script (test.py) puts everything together: it opens the Douban comments page, repeatedly extracts comment texts, clicks the next‑page button, stops after ten pages, prints the collected text, and finally calls wordcloudutil.getWordCloud to visualize the data.
# -*- coding = utf-8 -*-
# @Time : 2021/6/27 22:29
# @Author : xiaow
# @File : test.py
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from api import wordcloudutil
if __name__ == '__main__':
url = 'https://movie.douban.com/subject/34812928/comments?limit=20&status=P&sort=new_score'
# 躲避智能检测
option = webdriver.ChromeOptions()
# option.headless = True
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=option)
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
})
driver.get(url)
text = ''
j = 0
while True:
time.sleep(1)
driver.switch_to.window(driver.window_handles[0])
options = driver.find_elements(By.XPATH, '//span[@class="short"]')
for i in options:
text = text + i.text
time.sleep(2)
nextpage = driver.find_element(By.XPATH, '//*[@id="paginator"]/a')
nextpage.click()
j = j + 1
if j > 10:
break
print(text)
wordcloudutil.getWordCloud(text)After execution, the collected comments are visualized as a word cloud image, demonstrating a simple yet effective pipeline from web scraping to data visualization.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.