Scraping QQ Music Hot Comments with Selenium and Visualizing with Word Cloud in Python
This tutorial walks through using Python's Selenium to automate scrolling and extract QQ Music hot comment data—including user avatars, names, timestamps, and content—then saves the information to CSV and creates a Chinese word cloud for visual analysis.
This tutorial demonstrates how to use Python Selenium to collect hot comments from a QQ Music song page, handle infinite scrolling, extract user avatar URLs, nicknames, comment times, and comment texts, and store the data in a CSV file.
1. Initial Test – Verify the Selenium environment by opening the target URL, maximizing the window, and pausing briefly.
from selenium import webdriver
import time
url = 'https://y.qq.com/n/ryqq/songDetail/0006wgUu1hHP0N'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(1)
driver.maximize_window()2. Page Analysis – The comment section uses a waterfall layout that loads more items as the right-side scrollbar moves; the page URL remains unchanged, so Selenium must control scrolling to load additional comments. Each comment corresponds to an li element.
3. Scroll Wheel Operation – Loop until the desired number of comments is reached, scrolling to the bottom and waiting for data to load.
num = int(input('请输入目标评论数:')) # target comment count
_single = True
while _single:
items = driver.find_elements_by_xpath("//li[@class='comment__list_item c_b_normal']")
print(len(items))
if len(items) < num:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
else:
_single = False4. Parse Page – Iterate over each li element, extract avatar URL, nickname, time, and comment content, clean newline characters, and store each record in a dictionary appended to a list.
info_list = []
for index, item in enumerate(items):
dic = {}
try:
headPortraits = item.find_element_by_xpath("./div[1]/a/img").get_attribute('src')
name = item.find_element_by_xpath("./div[1]/h4/a").text
time = item.find_element_by_xpath("./div[1]/div[1]").text
content = item.find_element_by_xpath("./div[1]/p/span").text.replace('\n', '')
dic['headPor'] = headPortraits
dic['name'] = name
dic['time'] = time
dic['cont'] = content
print(index+1)
print(dic)
info_list.append(dic)
except Exception as e:
print(e)5. Data Storage – Write the list of dictionaries to a CSV file using Python's csv module.
import csv
head = ['headPor','name','time','cont']
with open('bscxComment.csv', 'w', encoding='utf-8', newline='') as f:
writer = csv.DictWriter(f, head)
writer.writeheader()
writer.writerows(info_list)
print('写入成功')6. Run the Program – Execute the script, observe the scrolling and data collection, then open the generated CSV to verify that thousands of comments have been captured.
7. Word Cloud Visualization – Import jieba , PIL , numpy , pandas , and wordcloud , clean the comment text, perform Chinese word segmentation, and generate a word cloud shaped by a mask image.
# Import libraries
import jieba
from PIL import Image
import numpy as np
import pandas as pd
from wordcloud import WordCloud
# Load comment data and clean
text = open("./data.txt", encoding='utf-8').read()
text_cut = jieba.lcut(text)
text_cut = ' '.join(text_cut)
# Prepare mask image
mask_pic = np.array(Image.open("./cat.png"))
word = WordCloud(font_path='C:/Windows/Fonts/simfang.ttf',
mask=mask_pic,
background_color='white',
max_font_size=150,
max_words=2000,
stopwords={'的'}).generate(text_cut)
image = word.to_image()
word.to_file('bsx.png')
image.show()8. Summary – Using Selenium to simulate human browsing can bypass some anti‑scraping measures and efficiently gather large volumes of QQ Music comments; the collected data can then be analyzed and visualized, for example with a word cloud.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.