What Can NetEase Music Comments Reveal? A Python Scraping & Visualization Guide
This article walks through collecting NetEase Cloud Music hot comments via a Python API, cleaning the data with pandas, and visualizing insights such as comment timing, user activity, word clouds, age distribution, regional maps, and gender ratios using pyecharts.
Preface
Recently a student asked for help with a visualization assignment that required scraping 10,000 hot comments from NetEase Cloud Music, followed by data analysis and a report. The project combines web crawling, data processing, and visualization.
Data Source
The comments are obtained through NetEase's public API, so the crawling difficulty is low and the raw CSV file (music_comments.csv) is used directly.
Analysis Process
Time Distribution
The following code extracts the hour from the comment timestamp, groups by hour, and draws a line chart.
import pandas as pd
from pyecharts import Line
# Read data
df = pd.read_csv('music_comments.csv', header=None, names=['name','userid','age','gender','city','text','comment','commentid','praise','date'], encoding='utf-8-sig')
# Remove duplicate comments
df = df.drop_duplicates('commentid')
df = df.dropna()
# Extract hour
df['time'] = [int(i.split(' ')[1].split(':')[0]) for i in df['date']]
# Group by hour
date_message = df.groupby(['time'])
date_com = date_message['time'].agg(['count']).reset_index()
# Plot line chart
attr = date_com['time']
v1 = date_com['count']
line = Line("Comment Time Distribution", title_pos='center', title_top='18', width=800, height=400)
line.add('', attr, v1, is_smooth=True, is_fill=True, area_color="#000", is_xaxislabel_align=True, xaxis_min="dataMin", area_opacity=0.3, mark_point=["max"], mark_point_symbol="pin", mark_point_symbolsize=55)
line.render("time_distribution.html")The resulting chart shows that users tend to comment in the late afternoon and evening.
User Comment Count
This snippet groups comments by user ID and lists the top 10 most active commenters.
import pandas as pd
# Read data
df = pd.read_csv('music_comments.csv', header=None, names=['name','userid','age','gender','city','text','comment','commentid','praise','date'], encoding='utf-8-sig')
df = df.drop_duplicates('commentid').dropna()
# Group by user
user_message = df.groupby(['userid'])
user_com = user_message['userid'].agg(['count']).reset_index()
user_com_last = user_com.sort_values('count', ascending=False).head(10)
print(user_com_last)The output reveals a few super‑fans who have posted hundreds of comments.
Word Cloud
A standard word‑cloud pipeline is used, with a custom color function and a background image.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import pandas as pd
import random
import jieba
# Random color function
def random_color_func(word=None, font_size=None, position=None, orientation=None, font_path=None, random_state=None):
h, s, l = random.choice([(188, 72, 53), (253, 63, 56), (12, 78, 69)])
return "hsl({}, {}%, {}%)".format(h, s, l)
# Load data
df = pd.read_csv('music_comments.csv', header=None, names=['name','userid','age','gender','city','text','comment','commentid','praise','date'], encoding='utf-8-sig')
df = df.drop_duplicates('commentid').dropna()
# Load stopwords
words = pd.read_csv('chineseStopWords.txt', encoding='gbk', sep='\t', names=['stopword'])
# Tokenize
text = ''
for line in df['comment']:
text += ' '.join(jieba.cut(str(line), cut_all=False))
# Stopwords
stopwords = set()
stopwords.update(words['stopword'])
# Background image
background = plt.imread('music.jpg')
wc = WordCloud(background_color='white', mask=background, font_path='FZSTK.TTF', max_words=2000, max_font_size=250, min_font_size=15, color_func=random_color_func, prefer_horizontal=1, random_state=50, stopwords=stopwords)
wc.generate_from_text(text)
plt.imshow(wc)
plt.axis('off')
wc.to_file('netease_wordcloud.jpg')
print('Word cloud generated!')The generated word cloud highlights the most frequent terms in the comments.
User Age Distribution
Using the same preprocessing, a bar chart of age groups is produced (image omitted for brevity).
Regional Distribution
A map of China visualizes comment counts by province. The code maps city codes to province names and draws a choropleth map with pyecharts.
import pandas as pd
from pyecharts import Map
def city_group(cityCode):
city_map = {
'11':'北京','12':'天津','31':'上海','50':'重庆','81':'香港','82':'澳门','13':'河北','14':'山西','15':'内蒙古','21':'辽宁','22':'吉林','23':'黑龙江','32':'江苏','33':'浙江','34':'安徽','35':'福建','36':'江西','37':'山东','41':'河南','42':'湖北','43':'湖南','44':'广东','45':'广西','46':'海南','51':'四川','52':'贵州','53':'云南','54':'西藏','61':'陕西','62':'甘肃','63':'青海','64':'宁夏','65':'新疆','71':'台湾','10':'其他'}
return city_map[str(cityCode)[:2]]
# Load data
df = pd.read_csv('music_comments.csv', header=None, names=['name','userid','age','gender','city','text','comment','commentid','praise','date'], encoding='utf-8-sig')
df = df.drop_duplicates('commentid').dropna()
# Map province
df['location'] = df['city'].apply(city_group)
# Group by province
loc_message = df.groupby(['location'])
loc_com = loc_message['location'].agg(['count']).reset_index()
value = list(loc_com['count'])
attr = list(loc_com['location'])
map = Map("Commenter Regional Distribution", title_pos='center', title_top=0)
map.add('', attr, value, maptype='china', is_visualmap=True, visual_text_color='#000', is_map_symbol_show=False, visual_range=[0,60])
map.render('regional_distribution.html')The map shows that Sichuan and Guangdong have the highest comment volumes.
Gender Ratio
A simple bar chart (image omitted) indicates that female fans dominate the comment pool.
Conclusion
This tutorial demonstrates how to fetch NetEase Cloud Music hot comments via an API, clean and analyze the data with pandas, and create various visualizations with pyecharts, revealing patterns such as peak commenting times, active users, popular words, age and regional demographics, and gender distribution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
