Fundamentals 9 min read

Python Data Preprocessing and Visualization of Jay Chou Lyrics: From JSON to Word Cloud

This tutorial demonstrates how to convert a JSON lyric database into Excel, filter Jay Chou songs, perform Chinese word segmentation with Jieba, compute word frequencies, and create visualizations such as word clouds using Python code and online tools.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Python Data Preprocessing and Visualization of Jay Chou Lyrics: From JSON to Word Cloud

The example uses a Chinese lyric database stored in JSON format. First, the JSON file can be converted to a CSV or XLSX file with an online converter, then filtered in Excel to keep only songs by Jay Chou and export the lyrics to plain text.

Alternatively, the entire preprocessing can be scripted in Python. The required libraries are imported with:

import json

The JSON file is loaded:

with open('lyrics.json', 'r') as f:
    data = json.load(f)

Jay Chou entries are extracted:

data_zjl = [item for item in data if item['singer'] == '周杰伦']
print(len(data_zjl))

All lyrics are collected into a list and written to a text file:

Zjl_lyrics = []
for song in data_zjl:
    Zjl_lyrics = Zjl_lyrics + song['lyric']

with open('zjl_lyrics.txt', 'w') as outfifile:
    outfifile.write('\n'.join(Zjl_lyrics))

For word segmentation, Jieba, pandas, and Counter are used. A stop‑word list is loaded first:

with open('chinese_stop_words.txt') as f:
    stopwords = [line.strip() for line in f.readlines()]

The lyric file is read, segmented, cleaned, and frequency counted:

fifile = open('zjl_lyrics.txt').read()
words = jieba.lcut(fifile, cut_all=False, use_paddle=True)
words = [w for w in words if w not in stopwords]
words = [w.strip() for w in words]
words = [w for w in words if w != ' ']
words_fifilter = [w for w in words if len(w) > 1]
df = pd.DataFrame.from_dict(Counter(words_fifilter), orient='index').reset_index()
df = df.rename(columns={'index': 'words', 0: 'count'})
df.to_excel('周杰伦分词结果.xlsx')

The resulting Excel file contains each word and its frequency, which can be visualized. A word cloud is generated with the WordCloud library:

from wordcloud import WordCloud
wc = WordCloud(font_path='Alibaba-PuHuiTi-Regular.ttf', background_color='white', max_words=2000)
wc.generate(' '.join(words_fifilter))
import matplotlib.pyplot as plt
plt.imshow(wc)
plt.figure(figsize=(12,10), dpi=300)
plt.axis('off')
plt.show()

Alternatively, online tools such as 微词云, 易词云, or 图悦 can import the Excel file (columns named "单词" and "词频") to produce customizable word clouds, allowing users to filter by part of speech, adjust font size, colors, and mask shapes (e.g., a portrait of Jay Chou).

Other visualizations like proportional circles can be created by importing the frequency data into chart tools and generating pie‑style diagrams.

The guide emphasizes that different segmentation methods may yield varying results, so readers should compare approaches to choose the most suitable one for their analysis.

data preprocessingvisualizationpandasjiebatext analysisWordCloud
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.