Big Data 8 min read

Scrape QQ Music Hot Comments with Python Selenium and Visualize Them

This tutorial demonstrates how to use Python and Selenium to collect thousands of hot comments from QQ Music, parse the data from dynamically loaded list items, store it in CSV files, and finally generate a word‑cloud visualization of the comment content.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Scrape QQ Music Hot Comments with Python Selenium and Visualize Them

Introduction

Today we present a Python script that uses Selenium to collect hot comments from QQ Music. The target song has over 10,000 comments, making it a good example for large‑scale data extraction.

1. Initial Test

First, we verify the environment with Selenium by opening the song page, waiting briefly, and maximizing the browser window.

from selenium import webdriver
import time
url = 'https://y.qq.com/n/ryqq/songDetail/0006wgUu1hHP0N'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(1)
driver.maximize_window()
Note: To avoid login prompts, manually log in to QQ Music beforehand so that the browser retains the necessary cookies.

2. Page Analysis

The comment list uses an infinite‑scroll (waterfall) layout; new comments load as the right‑hand scrollbar moves, while the URL remains unchanged. Each comment corresponds to an li element, so the scraping strategy is to scroll to the bottom repeatedly, monitor the number of li elements, and stop when the desired count is reached.

3. Scroll Operation

We loop the scroll action until the number of comment items reaches the target, pausing briefly after each scroll to allow data to load.

num = int(input('请输入目标评论数:'))  # Desired number of comments
_single = True
while _single:
    items = driver.find_elements_by_xpath("//li[@class='comment__list_item c_b_normal']")
    print(len(items))
    if len(items) < num:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
    else:
        _single = False

4. Parse Page

For each li element we extract the avatar URL, user name, comment time, and comment text, clean newline characters, and store the information in a dictionary.

info_list = []
for index, item in enumerate(items):
    dic = {}
    try:
        headPortraits = item.find_element_by_xpath("./div[1]/a/img").get_attribute('src')
        name = item.find_element_by_xpath("./div[1]/h4/a").text
        time = item.find_element_by_xpath("./div[1]/div[1]").text
        content = item.find_element_by_xpath("./div[1]/p/span").text.replace('
', '')
        dic['headPor'] = headPortraits
        dic['name'] = name
        dic['time'] = time
        dic['cont'] = content
        print(index + 1)
        print(dic)
        info_list.append(dic)
    except Exception as e:
        print(e)

5. Data Storage

The collected dictionaries are written to a CSV file using the csv module.

import csv
head = ['headPor', 'name', 'time', 'cont']
with open('bscxComment.csv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.DictWriter(f, head)
    writer.writeheader()
    writer.writerows(info_list)
    print('写入成功')

6. Word‑Cloud Visualization

After extracting the comment text, we clean it, perform Chinese word segmentation with jieba, and generate a word cloud shaped by a mask image.

# Import libraries
import jieba
from PIL import Image
import numpy as np
import pandas as pd
from wordcloud import WordCloud

# Load comment text
text = open("./data.txt", encoding='utf-8').read()
text_cut = jieba.lcut(text)
text_cut = ' '.join(text_cut)

# Load mask image
mask_pic = np.array(Image.open("./cat.png"))

word = WordCloud(font_path='C:/Windows/Fonts/simfang.ttf',
                  mask=mask_pic,
                  background_color='white',
                  max_font_size=150,
                  max_words=2000,
                  stopwords={'的'}).generate(text_cut)
image = word.to_image()
word.to_file('bsx.png')
image.show()

Conclusion

Using Selenium to simulate human browsing allows us to bypass many anti‑scraping measures and efficiently collect large volumes of QQ Music comments. While the method works well, attention to login handling and error catching is essential for reliable data acquisition.

Data ExtractionCSVqq musicweb-scrapingseleniumword cloud
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.