Backend Development 7 min read

How to Scrape QQ Music Hot Comments with Python Selenium and Generate Word Clouds

This tutorial walks you through using Python Selenium to automate QQ Music, scroll through the infinite‑scroll comment section, extract user avatars, names, timestamps and comment text, save the data to CSV, and finally visualize the comments with a word‑cloud image.

MaGe Linux Operations

Aug 20, 2021

How to Scrape QQ Music Hot Comments with Python Selenium and Generate Word Clouds

1. Initial Test

First, verify the environment with Selenium:

from selenium import webdriver
import time
url = 'https://y.qq.com/n/ryqq/songDetail/0006wgUu1hHP0N'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(1)
driver.maximize_window()

Note: To avoid login prompts, manually log into QQ Music beforehand so the browser retains the necessary cookies.

2. Page Analysis

The comment list uses a waterfall layout; new comments load as the right‑hand scroll bar moves, without changing the URL. Each comment corresponds to an <li> element.

3. Scroll Operation

Continuously scroll to the bottom until the desired number of comments is loaded:

num = int(input('请输入目标评论数:'))  # target comment count
_single = True
while _single:
    items = driver.find_elements_by_xpath("//li[@class='comment__list_item c_b_normal']")
    print(len(items))
    if len(items) < num:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
    else:
        _single = False

4. Parse Page

Extract avatar URL, nickname, comment time and content from each <li>:

info_list = []
for index, item in enumerate(items):
    try:
        headPortraits = item.find_element_by_xpath("./div[1]/a/img").get_attribute('src')
        name = item.find_element_by_xpath("./div[1]/h4/a").text
        time = item.find_element_by_xpath("./div[1]/div[1]").text
        content = item.find_element_by_xpath("./div[1]/p/span").text.replace('
', '')
        dic = {'headPor': headPortraits, 'name': name, 'time': time, 'cont': content}
        print(index + 1)
        print(dic)
        info_list.append(dic)
    except Exception as e:
        print(e)

5. Data Storage

Save the collected dictionaries to a CSV file using the csv module:

import csv
head = ['headPor', 'name', 'time', 'cont']
with open('bscxComment.csv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.DictWriter(f, head)
    writer.writeheader()
    writer.writerows(info_list)
    print('写入成功')

6. Run Program

Execute the script, then open the generated bscxComment.csv to verify that roughly 5,000 comments have been captured.

7. Word‑Cloud Visualization

Import required libraries and generate a word cloud from the comment text:

# Import libraries
import jieba
from PIL import Image
import numpy as np
import pandas as pd
from wordcloud import WordCloud

# Load comment data (ensure the 'cont' column is a string)
with open('data.txt', encoding='utf-8', mode='a') as f:
    for item in data['cont']:
        if isinstance(item, str):
            f.write(item)
    print('写入成功!')

# Prepare text for word cloud
text = open('./data.txt', encoding='utf-8').read()
text_cut = jieba.lcut(text)
space_word_list = ' '.join(text_cut)

mask_pic = np.array(Image.open('./cat.png'))
word = WordCloud(font_path='C:/Windows/Fonts/simfang.ttf',
                mask=mask_pic,
                background_color='white',
                max_font_size=150,
                max_words=2000,
                stopwords={'的'}).generate(space_word_list)
word.to_file('bsx.png')
word.to_image().show()

8. Conclusion

Using Selenium to simulate human browsing allows you to bypass many anti‑scraping measures and reliably collect QQ Music comment data; however, be mindful of login requirements and occasional parsing errors. Feel free to experiment further.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data collection Python CSV qq music Selenium wordcloud web-scraping

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.