Scrape QQ Music Hot Comments with Python Selenium and Visualize Them
This tutorial demonstrates how to use Python and Selenium to collect thousands of hot comments from QQ Music, parse the data from dynamically loaded list items, store it in CSV files, and finally generate a word‑cloud visualization of the comment content.
Introduction
Today we present a Python script that uses Selenium to collect hot comments from QQ Music. The target song has over 10,000 comments, making it a good example for large‑scale data extraction.
1. Initial Test
First, we verify the environment with Selenium by opening the song page, waiting briefly, and maximizing the browser window.
from selenium import webdriver
import time
url = 'https://y.qq.com/n/ryqq/songDetail/0006wgUu1hHP0N'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(1)
driver.maximize_window()Note: To avoid login prompts, manually log in to QQ Music beforehand so that the browser retains the necessary cookies.
2. Page Analysis
The comment list uses an infinite‑scroll (waterfall) layout; new comments load as the right‑hand scrollbar moves, while the URL remains unchanged. Each comment corresponds to an li element, so the scraping strategy is to scroll to the bottom repeatedly, monitor the number of li elements, and stop when the desired count is reached.
3. Scroll Operation
We loop the scroll action until the number of comment items reaches the target, pausing briefly after each scroll to allow data to load.
num = int(input('请输入目标评论数:')) # Desired number of comments
_single = True
while _single:
items = driver.find_elements_by_xpath("//li[@class='comment__list_item c_b_normal']")
print(len(items))
if len(items) < num:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
else:
_single = False4. Parse Page
For each li element we extract the avatar URL, user name, comment time, and comment text, clean newline characters, and store the information in a dictionary.
info_list = []
for index, item in enumerate(items):
dic = {}
try:
headPortraits = item.find_element_by_xpath("./div[1]/a/img").get_attribute('src')
name = item.find_element_by_xpath("./div[1]/h4/a").text
time = item.find_element_by_xpath("./div[1]/div[1]").text
content = item.find_element_by_xpath("./div[1]/p/span").text.replace('
', '')
dic['headPor'] = headPortraits
dic['name'] = name
dic['time'] = time
dic['cont'] = content
print(index + 1)
print(dic)
info_list.append(dic)
except Exception as e:
print(e)5. Data Storage
The collected dictionaries are written to a CSV file using the csv module.
import csv
head = ['headPor', 'name', 'time', 'cont']
with open('bscxComment.csv', 'w', encoding='utf-8', newline='') as f:
writer = csv.DictWriter(f, head)
writer.writeheader()
writer.writerows(info_list)
print('写入成功')6. Word‑Cloud Visualization
After extracting the comment text, we clean it, perform Chinese word segmentation with jieba, and generate a word cloud shaped by a mask image.
# Import libraries
import jieba
from PIL import Image
import numpy as np
import pandas as pd
from wordcloud import WordCloud
# Load comment text
text = open("./data.txt", encoding='utf-8').read()
text_cut = jieba.lcut(text)
text_cut = ' '.join(text_cut)
# Load mask image
mask_pic = np.array(Image.open("./cat.png"))
word = WordCloud(font_path='C:/Windows/Fonts/simfang.ttf',
mask=mask_pic,
background_color='white',
max_font_size=150,
max_words=2000,
stopwords={'的'}).generate(text_cut)
image = word.to_image()
word.to_file('bsx.png')
image.show()Conclusion
Using Selenium to simulate human browsing allows us to bypass many anti‑scraping measures and efficiently collect large volumes of QQ Music comments. While the method works well, attention to login handling and error catching is essential for reliable data acquisition.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
