Fundamentals 10 min read

Turn Chinese Lyrics JSON into Word Clouds with Python

This guide walks you through extracting Jay Chou's lyrics from a Chinese lyrics JSON database, preprocessing the text with Python and Jieba, generating frequency tables, and visualizing the results as word clouds using both code and online tools.

Python Crawling & Data Mining

Apr 16, 2023

Turn Chinese Lyrics JSON into Word Clouds with Python

Data Source and Goal

The lyrics data come from a Chinese lyrics database stored in JSON format. The objective is to preprocess the data, extract Jay Chou's songs, perform word segmentation, and create visualizations such as word clouds.

Method 1: Convert JSON to Excel

Use an online JSON‑to‑CSV/Excel converter, open the file in Excel, filter the singer column for “周杰伦”, copy the lyrics, and save them as a plain text (.txt) file.

Method 2: Python Pre‑processing

Below is the Python code that reads the JSON, filters for Jay Chou, and writes all lyrics to a text file.

import json

with open('lyrics.json', 'r') as f:
    data = json.load(f)

data_zjl = [item for item in data if item['singer'] == '周杰伦']
print(len(data_zjl))

zjl_lyrics = []
for song in data_zjl:
    zjl_lyrics = zjl_lyrics + song['lyric']

with open('zjl_lyrics.txt', 'w') as outfifile:
    outfifile.write('
'.join(zjl_lyrics))

The resulting zjl_lyrics.txt contains all of Jay Chou's lyrics (see Figure 1).

Word Segmentation with Jieba

Install the required libraries and load a Chinese stop‑words list.

import jieba
import jieba.analyse
import pandas as pd
from collections import Counter

with open('chinese_stop_words.txt') as f:
    stopwords = [line.strip() for line in f.readlines()]

Read the lyrics file, segment the text, remove stop‑words and symbols, and count word frequencies.

fifile = open('zjl_lyrics.txt').read()
words = jieba.lcut(fifile, cut_all=False, use_paddle=True)
words = [w for w in words if w not in stopwords]
words = [w.strip() for w in words]
words = [w for w in words if w != ' ']
words_fifilter = [w for w in words if len(w) > 1]
df = pd.DataFrame.from_dict(Counter(words_fifilter), orient='index').reset_index()
df = df.rename(columns={'index': 'words', 0: 'count'})
df.to_excel('周杰伦分词结果.xlsx')

The frequency table (Figure 2) can be used for further visualization.

Word Cloud Generation (Python)

from wordcloud import WordCloud
# Need a Chinese font to avoid garbled characters
wc = WordCloud(font_path='Alibaba-PuHuiTi-Regular.ttf', background_color='white', max_words=2000)
wc.generate(' '.join(words_fifilter))
import matplotlib.pyplot as plt
plt.imshow(wc)
plt.figure(figsize=(12,10), dpi=300)
plt.axis('off')
plt.show()

The resulting word cloud is shown in Figure 3.

Online Word‑Cloud Tools

Web tools such as 微词云, 易词云, and 图悦 also support Chinese word‑cloud creation. Using 微词云 as an example, upload the Excel file with “单词” and “词频” columns, choose “分词筛词后导入”, and generate the cloud (Figure 4).

Further customization—selecting word categories, adjusting font size proportion, setting the number of words, and choosing a mask shape—allows fine‑tuned visual output (Figures 5 and 6).

Alternative Visualizations

Beyond word clouds, other charts such as proportional circles can display high‑frequency terms (Figure 7).

Both code‑based and online approaches have trade‑offs; users should compare results to choose the best method for their needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

text-mining word cloud

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.