Fundamentals 7 min read

Creating Chinese Word Clouds with Python: Using Jieba and WordCloud

This tutorial explains how to install and use the Jieba segmentation library and the WordCloud package in Python to process Chinese text, customize dictionaries and stopwords, and generate visually appealing word cloud images based on a mask picture.

Python Programming Learning Circle

Mar 30, 2020

Creating Chinese Word Clouds with Python: Using Jieba and WordCloud

WordCloud is a Python library that visualizes text by arranging words into an image, making it easy to grasp the main themes of a document.

First, install the Jieba segmentation library: pip install jieba Jieba offers three segmentation modes: full mode (cuts all possible words, fast but ambiguous), precise mode (most accurate, suitable for analysis), and search engine mode (precise mode plus additional cuts for better recall).

Example code demonstrating the three modes:

import jieba
text = "哈利波特是一常优秀的文学作品"
# Full mode
seg_list = jieba.cut(text, cut_all=True)
print("[全模式]: ", "/ ".join(seg_list))
# Precise mode
seg_list = jieba.cut(text, cut_all=False)
print("[精确模式]: ", "/ ".join(seg_list))
# Default (precise) mode
seg_list = jieba.cut(text)
print("[默认模式]: ", "/ ".join(seg_list))
# Search engine mode
seg_list = jieba.cut_for_search(text)
print("[搜索引擎模式]: ", "/ ".join(seg_list))

Because Jieba’s dictionary may not contain proper nouns like “哈利波特”, you can add a custom dictionary:

jieba.load_userdict("/home/jmhao/anaconda3/lib/python3.7/site-packages/jieba/mydict.txt")

After loading the custom dictionary, the term “哈利波特” is correctly recognized.

You can also define stopwords to filter out unwanted terms:

stopwords = {}.fromkeys(['优秀', '文学作品'])
seg_list = jieba.cut(text)
final = ''
for seg in seg_list:
    if seg not in stopwords:
        final += seg
seg_list_new = jieba.cut(final)
print("[切割之后]: ", "/ ".join(seg_list_new))

Install the WordCloud package (Anaconda already includes most dependencies; otherwise install numpy and Pillow as well): pip install wordcloud Prepare the text file (e.g., xiaoshuo.txt) containing the Chinese novel excerpt, read it, and segment it using Jieba:

# Read and segment text
with open("xiaoshuo.txt") as fp:
    text = fp.read()
text = cut(text)  # cut() is a helper function using jieba

Load a mask image (white‑background picture) to shape the word cloud: mask = np.array(image.open("monkey.jpeg")) Full code to generate the word cloud:

# Import libraries
from wordcloud import WordCloud
import PIL.Image as image
import numpy as np
import jieba

# Helper function for segmentation
def cut(text):
    word_list = jieba.cut(text, cut_all=True)
    return " ".join(word_list)

# Read source text
with open("xiaoshuo.txt") as fp:
    text = fp.read()
text = cut(text)

# Set mask image
mask = np.array(image.open("monkey.jpeg"))

# Create WordCloud object
wordcloud = WordCloud(
    mask=mask,
    background_color='#FFFFFF',
    font_path="/usr/share/fonts/bb5828/逐浪雅宋体.otf"
).generate(text)

# Save and display
image_produce = wordcloud.to_image()
wordcloud.to_file("new_wordcloud.jpg")
image_produce.show()

Note: The mask image should have a white background (or be processed to white) to avoid generating a rectangular cloud.

Original novel page (example image):

Resulting word cloud:

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

tutorial visualization jieba wordcloud TextProcessing

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.