Creating Chinese Word Clouds with Python: Using Jieba and WordCloud
This tutorial explains how to install and use the Jieba segmentation library and the WordCloud package in Python to process Chinese text, customize dictionaries and stopwords, and generate visually appealing word cloud images based on a mask picture.
WordCloud is a Python library that visualizes text by arranging words into an image, making it easy to grasp the main themes of a document.
First, install the Jieba segmentation library:
<code>pip install jieba</code>Jieba offers three segmentation modes: full mode (cuts all possible words, fast but ambiguous), precise mode (most accurate, suitable for analysis), and search engine mode (precise mode plus additional cuts for better recall).
Example code demonstrating the three modes:
<code>import jieba
text = "哈利波特是一常优秀的文学作品"
# Full mode
seg_list = jieba.cut(text, cut_all=True)
print("[全模式]: ", "/ ".join(seg_list))
# Precise mode
seg_list = jieba.cut(text, cut_all=False)
print("[精确模式]: ", "/ ".join(seg_list))
# Default (precise) mode
seg_list = jieba.cut(text)
print("[默认模式]: ", "/ ".join(seg_list))
# Search engine mode
seg_list = jieba.cut_for_search(text)
print("[搜索引擎模式]: ", "/ ".join(seg_list))</code>Because Jieba’s dictionary may not contain proper nouns like “哈利波特”, you can add a custom dictionary:
<code>jieba.load_userdict("/home/jmhao/anaconda3/lib/python3.7/site-packages/jieba/mydict.txt")</code>After loading the custom dictionary, the term “哈利波特” is correctly recognized.
You can also define stopwords to filter out unwanted terms:
<code>stopwords = {}.fromkeys(['优秀', '文学作品'])
seg_list = jieba.cut(text)
final = ''
for seg in seg_list:
if seg not in stopwords:
final += seg
seg_list_new = jieba.cut(final)
print("[切割之后]: ", "/ ".join(seg_list_new))</code>Install the WordCloud package (Anaconda already includes most dependencies; otherwise install numpy and Pillow as well):
<code>pip install wordcloud</code>Prepare the text file (e.g., xiaoshuo.txt ) containing the Chinese novel excerpt, read it, and segment it using Jieba:
<code># Read and segment text
with open("xiaoshuo.txt") as fp:
text = fp.read()
text = cut(text) # cut() is a helper function using jieba</code>Load a mask image (white‑background picture) to shape the word cloud:
<code>mask = np.array(image.open("monkey.jpeg"))</code>Full code to generate the word cloud:
<code># Import libraries
from wordcloud import WordCloud
import PIL.Image as image
import numpy as np
import jieba
# Helper function for segmentation
def cut(text):
word_list = jieba.cut(text, cut_all=True)
return " ".join(word_list)
# Read source text
with open("xiaoshuo.txt") as fp:
text = fp.read()
text = cut(text)
# Set mask image
mask = np.array(image.open("monkey.jpeg"))
# Create WordCloud object
wordcloud = WordCloud(
mask=mask,
background_color='#FFFFFF',
font_path="/usr/share/fonts/bb5828/逐浪雅宋体.otf"
).generate(text)
# Save and display
image_produce = wordcloud.to_image()
wordcloud.to_file("new_wordcloud.jpg")
image_produce.show()
</code>Note: The mask image should have a white background (or be processed to white) to avoid generating a rectangular cloud.
Original novel page (example image):
Resulting word cloud:
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.