How to Scrape QQ Music Hot Comments with Python Selenium and Generate Word Clouds
This tutorial walks you through using Python Selenium to automate QQ Music, scroll through the infinite‑scroll comment section, extract user avatars, names, timestamps and comment text, save the data to CSV, and finally visualize the comments with a word‑cloud image.
1. Initial Test
First, verify the environment with Selenium:
from selenium import webdriver
import time
url = 'https://y.qq.com/n/ryqq/songDetail/0006wgUu1hHP0N'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(1)
driver.maximize_window()Note: To avoid login prompts, manually log into QQ Music beforehand so the browser retains the necessary cookies.
2. Page Analysis
The comment list uses a waterfall layout; new comments load as the right‑hand scroll bar moves, without changing the URL. Each comment corresponds to an <li> element.
3. Scroll Operation
Continuously scroll to the bottom until the desired number of comments is loaded:
num = int(input('请输入目标评论数:')) # target comment count
_single = True
while _single:
items = driver.find_elements_by_xpath("//li[@class='comment__list_item c_b_normal']")
print(len(items))
if len(items) < num:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
else:
_single = False4. Parse Page
Extract avatar URL, nickname, comment time and content from each <li>:
info_list = []
for index, item in enumerate(items):
try:
headPortraits = item.find_element_by_xpath("./div[1]/a/img").get_attribute('src')
name = item.find_element_by_xpath("./div[1]/h4/a").text
time = item.find_element_by_xpath("./div[1]/div[1]").text
content = item.find_element_by_xpath("./div[1]/p/span").text.replace('
', '')
dic = {'headPor': headPortraits, 'name': name, 'time': time, 'cont': content}
print(index + 1)
print(dic)
info_list.append(dic)
except Exception as e:
print(e)5. Data Storage
Save the collected dictionaries to a CSV file using the csv module:
import csv
head = ['headPor', 'name', 'time', 'cont']
with open('bscxComment.csv', 'w', encoding='utf-8', newline='') as f:
writer = csv.DictWriter(f, head)
writer.writeheader()
writer.writerows(info_list)
print('写入成功')6. Run Program
Execute the script, then open the generated bscxComment.csv to verify that roughly 5,000 comments have been captured.
7. Word‑Cloud Visualization
Import required libraries and generate a word cloud from the comment text:
# Import libraries
import jieba
from PIL import Image
import numpy as np
import pandas as pd
from wordcloud import WordCloud
# Load comment data (ensure the 'cont' column is a string)
with open('data.txt', encoding='utf-8', mode='a') as f:
for item in data['cont']:
if isinstance(item, str):
f.write(item)
print('写入成功!')
# Prepare text for word cloud
text = open('./data.txt', encoding='utf-8').read()
text_cut = jieba.lcut(text)
space_word_list = ' '.join(text_cut)
mask_pic = np.array(Image.open('./cat.png'))
word = WordCloud(font_path='C:/Windows/Fonts/simfang.ttf',
mask=mask_pic,
background_color='white',
max_font_size=150,
max_words=2000,
stopwords={'的'}).generate(space_word_list)
word.to_file('bsx.png')
word.to_image().show()8. Conclusion
Using Selenium to simulate human browsing allows you to bypass many anti‑scraping measures and reliably collect QQ Music comment data; however, be mindful of login requirements and occasional parsing errors. Feel free to experiment further.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
