Artificial Intelligence 9 min read

Processing Chinese Lyrics Data with Python: From JSON Extraction to Word Cloud Visualization

This tutorial demonstrates how to preprocess a Chinese lyrics JSON dataset, extract Jay Chou's songs using Python, perform word segmentation with Jieba, compute word frequencies, and create visualizations such as word clouds both programmatically and with online tools.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Processing Chinese Lyrics Data with Python: From JSON Extraction to Word Cloud Visualization

The case study uses a Chinese lyrics database stored in JSON format, focusing on extracting songs by singer Jay Chou and preparing the data for visualization.

Data preprocessing can be done by converting the JSON file to CSV/Excel with online tools, or by using Python to filter the JSON, collect lyrics, and save them to a plain text file.

Python code for the extraction process:

<code>import json</code><code>with open('lyrics.json', 'r') as f:</code><code>    data = json.load(f)</code><code>data_zjl = [item for item in data if item['singer'] == '周杰伦']</code><code>print(len(data_zjl))</code><code>Zjl_lyrics = []</code><code>for song in data_zjl:</code><code>    Zjl_lyrics = Zjl_lyrics + song['lyric']</code><code>with open('zjl_lyrics.txt', 'w') as outfifile:</code><code>    outfifile.write('\n'.join(Zjl_lyrics))</code>

After obtaining the lyrics text, word segmentation is performed with the jieba library, removing stop‑words and filtering short tokens, then counting frequencies using Counter and saving results to an Excel file.

<code>import jieba</code><code>import jieba.analyse</code><code>import pandas as pd</code><code>from collections import Counter</code><code>with open('chinese_stop_words.txt') as f:</code><code>    stopwords = [line.strip() for line in f.readlines()]</code><code>fifile = open('zjl_lyrics.txt').read()</code><code>words = jieba.lcut(fifile, cut_all=False, use_paddle=True)</code><code>words = [w for w in words if w not in stopwords and w.strip() and len(w) > 1]</code><code>df = pd.DataFrame.from_dict(Counter(words), orient='index').reset_index()</code><code>df = df.rename(columns={'index': 'words', 0: 'count'})</code><code>df.to_excel('周杰伦分词结果.xlsx')</code>

The resulting word frequencies can be visualized with a word cloud using the wordcloud library:

<code>from wordcloud import WordCloud</code><code>wc = WordCloud(font_path='Alibaba-PuHuiTi-Regular.ttf', background_color='white', max_words=2000)</code><code>wc.generate(' '.join(words_fifilter))</code><code>import matplotlib.pyplot as plt</code><code>plt.imshow(wc)</code><code>plt.axis('off')</code><code>plt.show()</code>

Alternatively, online tools such as 微词云 allow uploading the Excel file to generate customizable word clouds, offering options for shape masks, word categories, and frequency‑based sizing.

Different segmentation methods may yield varying results, so users should compare approaches to select the most suitable one for their analysis.

NLPdata preprocessingvisualizationjiebaWordCloud
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.