Artificial Intelligence 10 min read

Master Chinese Text Segmentation with jieba: Installation, Modes, and Advanced Tricks

This tutorial walks you through installing the jieba Python library, explains its three segmentation modes—precise, full, and search—demonstrates how to add or delete words, manage custom dictionaries, handle stop words, perform weight analysis, adjust word frequencies, and retrieve token positions, all with clear code examples and visual output.

Python Crawling & Data Mining

Jun 16, 2021

Master Chinese Text Segmentation with jieba: Installation, Modes, and Advanced Tricks

Preface

Hello, I am Huang Wei. In the previous article we covered word clouds; now we explore word segmentation using the jieba library, which fills the gap of Chinese tokenization that the wordcloud package lacks.

1. Using jieba

Installation

Install jieba by extracting the package, opening a command window in the package directory, and running: python setup.py install After installation you can see the version information.

2. jieba Segmentation Modes

Precise Mode

This mode splits the text into the most accurate tokens without extra words. Common functions: lcut(str) and cut(str).

Example:

import jieba
aa = jieba.cut('任性的90后boy')
print('/'.join(aa))

The generator aa yields the segmented result.

Full Mode

All possible token combinations are listed. Functions:

lcut(str, cut_all=True)
cut(str, cut_all=True)

Search Engine Mode

Provides precise splitting and a second round of segmentation for longer words. Functions:

lcut_for_search(str)
cut_for_search(str)

You can also count the frequency of a specific word:

print(ab.count('武汉'))  # returns 1

3. Other Applications of jieba

1) Adding New Words

Custom words can be added to improve segmentation of names or phrases.

2) Adding a Dictionary

Load a user-defined dictionary with jieba.load_userdict(file). The file should contain one entry per line: word, optional frequency, optional part‑of‑speech, separated by spaces.

# Example line: 新词 10 n
jieba.load_userdict('mydict.txt')

3) Deleting Words

Unwanted custom words can be removed, restoring the original segmentation.

4) Handling Stop Words

Filter out common, non‑informative words (e.g., 的, 了, 哈哈) by maintaining a stop‑word list and skipping tokens that appear in it.

5) Weight Analysis

Rank words by frequency using jieba.analyse.extract_tags (or similar). Adding withWeight=True also returns frequencies.

Parameters: topK specifies how many words to output; withWeight includes their frequencies.

6) Adjusting Word Frequency

When using the HMM new‑word discovery, you may need to set HMM=False and then adjust frequencies with jieba.suggest_freq.

aa = jieba.lcut('我再也回不到童年美好的时光了，哈哈，想想都觉得伤心了', HMM=False)
print('/'.join(aa))
jieba.suggest_freq(('美','好'), tune=True)
aa = jieba.lcut('我再也回不到童年美好的时光了，哈哈，想想都觉得伤心了', HMM=False)
print('/'.join(aa))

7) Tokenizing to Get Start/End Positions

Use jieba.tokenize(text) to obtain each token’s start and end indices.

8) Changing Dictionary Path

If the default dictionary does not meet your needs, reinitialize jieba and set a new dictionary file:

import jieba
jieba.initialize()
jieba.set_dictionary('OSI.txt')

Conclusion

jieba is a powerful Chinese segmentation tool that serves as a useful building block for data analysis, allowing you to extract, filter, and manipulate text efficiently, though it is only one part of a larger NLP workflow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Tokenization NLP text processing jieba chinese segmentation

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.