Big Data 10 min read

What Can NetEase Music Comments Reveal? A Python Scraping & Visualization Guide

This article walks through collecting NetEase Cloud Music hot comments via a Python API, cleaning the data with pandas, and visualizing insights such as comment timing, user activity, word clouds, age distribution, regional maps, and gender ratios using pyecharts.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
What Can NetEase Music Comments Reveal? A Python Scraping & Visualization Guide

Preface

Recently a student asked for help with a visualization assignment that required scraping 10,000 hot comments from NetEase Cloud Music, followed by data analysis and a report. The project combines web crawling, data processing, and visualization.

Data Source

The comments are obtained through NetEase's public API, so the crawling difficulty is low and the raw CSV file (music_comments.csv) is used directly.

Analysis Process

Time Distribution

The following code extracts the hour from the comment timestamp, groups by hour, and draws a line chart.

import pandas as pd
from pyecharts import Line

# Read data
df = pd.read_csv('music_comments.csv', header=None, names=['name','userid','age','gender','city','text','comment','commentid','praise','date'], encoding='utf-8-sig')
# Remove duplicate comments
df = df.drop_duplicates('commentid')
df = df.dropna()
# Extract hour
df['time'] = [int(i.split(' ')[1].split(':')[0]) for i in df['date']]
# Group by hour
date_message = df.groupby(['time'])
date_com = date_message['time'].agg(['count']).reset_index()
# Plot line chart
attr = date_com['time']
v1 = date_com['count']
line = Line("Comment Time Distribution", title_pos='center', title_top='18', width=800, height=400)
line.add('', attr, v1, is_smooth=True, is_fill=True, area_color="#000", is_xaxislabel_align=True, xaxis_min="dataMin", area_opacity=0.3, mark_point=["max"], mark_point_symbol="pin", mark_point_symbolsize=55)
line.render("time_distribution.html")

The resulting chart shows that users tend to comment in the late afternoon and evening.

User Comment Count

This snippet groups comments by user ID and lists the top 10 most active commenters.

import pandas as pd

# Read data
df = pd.read_csv('music_comments.csv', header=None, names=['name','userid','age','gender','city','text','comment','commentid','praise','date'], encoding='utf-8-sig')
df = df.drop_duplicates('commentid').dropna()
# Group by user
user_message = df.groupby(['userid'])
user_com = user_message['userid'].agg(['count']).reset_index()
user_com_last = user_com.sort_values('count', ascending=False).head(10)
print(user_com_last)

The output reveals a few super‑fans who have posted hundreds of comments.

Word Cloud

A standard word‑cloud pipeline is used, with a custom color function and a background image.

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import pandas as pd
import random
import jieba

# Random color function
def random_color_func(word=None, font_size=None, position=None, orientation=None, font_path=None, random_state=None):
    h, s, l = random.choice([(188, 72, 53), (253, 63, 56), (12, 78, 69)])
    return "hsl({}, {}%, {}%)".format(h, s, l)

# Load data
df = pd.read_csv('music_comments.csv', header=None, names=['name','userid','age','gender','city','text','comment','commentid','praise','date'], encoding='utf-8-sig')
df = df.drop_duplicates('commentid').dropna()
# Load stopwords
words = pd.read_csv('chineseStopWords.txt', encoding='gbk', sep='\t', names=['stopword'])
# Tokenize
text = ''
for line in df['comment']:
    text += ' '.join(jieba.cut(str(line), cut_all=False))
# Stopwords
stopwords = set()
stopwords.update(words['stopword'])
# Background image
background = plt.imread('music.jpg')
wc = WordCloud(background_color='white', mask=background, font_path='FZSTK.TTF', max_words=2000, max_font_size=250, min_font_size=15, color_func=random_color_func, prefer_horizontal=1, random_state=50, stopwords=stopwords)
wc.generate_from_text(text)
plt.imshow(wc)
plt.axis('off')
wc.to_file('netease_wordcloud.jpg')
print('Word cloud generated!')

The generated word cloud highlights the most frequent terms in the comments.

User Age Distribution

Using the same preprocessing, a bar chart of age groups is produced (image omitted for brevity).

Regional Distribution

A map of China visualizes comment counts by province. The code maps city codes to province names and draws a choropleth map with pyecharts.

import pandas as pd
from pyecharts import Map

def city_group(cityCode):
    city_map = {
        '11':'北京','12':'天津','31':'上海','50':'重庆','81':'香港','82':'澳门','13':'河北','14':'山西','15':'内蒙古','21':'辽宁','22':'吉林','23':'黑龙江','32':'江苏','33':'浙江','34':'安徽','35':'福建','36':'江西','37':'山东','41':'河南','42':'湖北','43':'湖南','44':'广东','45':'广西','46':'海南','51':'四川','52':'贵州','53':'云南','54':'西藏','61':'陕西','62':'甘肃','63':'青海','64':'宁夏','65':'新疆','71':'台湾','10':'其他'}
    return city_map[str(cityCode)[:2]]

# Load data
df = pd.read_csv('music_comments.csv', header=None, names=['name','userid','age','gender','city','text','comment','commentid','praise','date'], encoding='utf-8-sig')
df = df.drop_duplicates('commentid').dropna()
# Map province
df['location'] = df['city'].apply(city_group)
# Group by province
loc_message = df.groupby(['location'])
loc_com = loc_message['location'].agg(['count']).reset_index()
value = list(loc_com['count'])
attr = list(loc_com['location'])
map = Map("Commenter Regional Distribution", title_pos='center', title_top=0)
map.add('', attr, value, maptype='china', is_visualmap=True, visual_text_color='#000', is_map_symbol_show=False, visual_range=[0,60])
map.render('regional_distribution.html')

The map shows that Sichuan and Guangdong have the highest comment volumes.

Gender Ratio

A simple bar chart (image omitted) indicates that female fans dominate the comment pool.

Conclusion

This tutorial demonstrates how to fetch NetEase Cloud Music hot comments via an API, clean and analyze the data with pandas, and create various visualizations with pyecharts, revealing patterns such as peak commenting times, active users, popular words, age and regional demographics, and gender distribution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Web Scraping
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.