Python Data Preprocessing and Visualization of Jay Chou Lyrics: From JSON to Word Cloud
This tutorial demonstrates how to convert a JSON lyric database into Excel, filter Jay Chou songs, perform Chinese word segmentation with Jieba, compute word frequencies, and create visualizations such as word clouds using Python code and online tools.
The example uses a Chinese lyric database stored in JSON format. First, the JSON file can be converted to a CSV or XLSX file with an online converter, then filtered in Excel to keep only songs by Jay Chou and export the lyrics to plain text.
Alternatively, the entire preprocessing can be scripted in Python. The required libraries are imported with:
import jsonThe JSON file is loaded:
with open('lyrics.json', 'r') as f:
data = json.load(f)Jay Chou entries are extracted:
data_zjl = [item for item in data if item['singer'] == '周杰伦']
print(len(data_zjl))All lyrics are collected into a list and written to a text file:
Zjl_lyrics = []
for song in data_zjl:
Zjl_lyrics = Zjl_lyrics + song['lyric']
with open('zjl_lyrics.txt', 'w') as outfifile:
outfifile.write('\n'.join(Zjl_lyrics))For word segmentation, Jieba, pandas, and Counter are used. A stop‑word list is loaded first:
with open('chinese_stop_words.txt') as f:
stopwords = [line.strip() for line in f.readlines()]The lyric file is read, segmented, cleaned, and frequency counted:
fifile = open('zjl_lyrics.txt').read()
words = jieba.lcut(fifile, cut_all=False, use_paddle=True)
words = [w for w in words if w not in stopwords]
words = [w.strip() for w in words]
words = [w for w in words if w != ' ']
words_fifilter = [w for w in words if len(w) > 1]
df = pd.DataFrame.from_dict(Counter(words_fifilter), orient='index').reset_index()
df = df.rename(columns={'index': 'words', 0: 'count'})
df.to_excel('周杰伦分词结果.xlsx')The resulting Excel file contains each word and its frequency, which can be visualized. A word cloud is generated with the WordCloud library:
from wordcloud import WordCloud
wc = WordCloud(font_path='Alibaba-PuHuiTi-Regular.ttf', background_color='white', max_words=2000)
wc.generate(' '.join(words_fifilter))
import matplotlib.pyplot as plt
plt.imshow(wc)
plt.figure(figsize=(12,10), dpi=300)
plt.axis('off')
plt.show()Alternatively, online tools such as 微词云, 易词云, or 图悦 can import the Excel file (columns named "单词" and "词频") to produce customizable word clouds, allowing users to filter by part of speech, adjust font size, colors, and mask shapes (e.g., a portrait of Jay Chou).
Other visualizations like proportional circles can be created by importing the frequency data into chart tools and generating pie‑style diagrams.
The guide emphasizes that different segmentation methods may yield varying results, so readers should compare approaches to choose the most suitable one for their analysis.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.