Master Chinese Text Segmentation with jieba: Installation, Modes, and Advanced Tricks
This tutorial walks you through installing the jieba Python library, explains its three segmentation modes—precise, full, and search—demonstrates how to add or delete words, manage custom dictionaries, handle stop words, perform weight analysis, adjust word frequencies, and retrieve token positions, all with clear code examples and visual output.
Preface
Hello, I am Huang Wei. In the previous article we covered word clouds; now we explore word segmentation using the jieba library, which fills the gap of Chinese tokenization that the wordcloud package lacks.
1. Using jieba
Installation
Install jieba by extracting the package, opening a command window in the package directory, and running: python setup.py install After installation you can see the version information.
2. jieba Segmentation Modes
Precise Mode
This mode splits the text into the most accurate tokens without extra words. Common functions: lcut(str) and cut(str).
Example:
import jieba
aa = jieba.cut('任性的90后boy')
print('/'.join(aa))The generator aa yields the segmented result.
Full Mode
All possible token combinations are listed. Functions:
lcut(str, cut_all=True)
cut(str, cut_all=True)Search Engine Mode
Provides precise splitting and a second round of segmentation for longer words. Functions:
lcut_for_search(str)
cut_for_search(str)You can also count the frequency of a specific word:
print(ab.count('武汉')) # returns 13. Other Applications of jieba
1) Adding New Words
Custom words can be added to improve segmentation of names or phrases.
2) Adding a Dictionary
Load a user-defined dictionary with jieba.load_userdict(file). The file should contain one entry per line: word, optional frequency, optional part‑of‑speech, separated by spaces.
# Example line: 新词 10 n
jieba.load_userdict('mydict.txt')3) Deleting Words
Unwanted custom words can be removed, restoring the original segmentation.
4) Handling Stop Words
Filter out common, non‑informative words (e.g., 的, 了, 哈哈) by maintaining a stop‑word list and skipping tokens that appear in it.
5) Weight Analysis
Rank words by frequency using jieba.analyse.extract_tags (or similar). Adding withWeight=True also returns frequencies.
Parameters: topK specifies how many words to output; withWeight includes their frequencies.
6) Adjusting Word Frequency
When using the HMM new‑word discovery, you may need to set HMM=False and then adjust frequencies with jieba.suggest_freq.
aa = jieba.lcut('我再也回不到童年美好的时光了,哈哈,想想都觉得伤心了', HMM=False)
print('/'.join(aa))
jieba.suggest_freq(('美','好'), tune=True)
aa = jieba.lcut('我再也回不到童年美好的时光了,哈哈,想想都觉得伤心了', HMM=False)
print('/'.join(aa))7) Tokenizing to Get Start/End Positions
Use jieba.tokenize(text) to obtain each token’s start and end indices.
8) Changing Dictionary Path
If the default dictionary does not meet your needs, reinitialize jieba and set a new dictionary file:
import jieba
jieba.initialize()
jieba.set_dictionary('OSI.txt')Conclusion
jieba is a powerful Chinese segmentation tool that serves as a useful building block for data analysis, allowing you to extract, filter, and manipulate text efficiently, though it is only one part of a larger NLP workflow.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
