Boost Chinese Sentiment Analysis: Master Jieba Segmentation and SnowNLP
This tutorial walks through Chinese text tokenization with Jieba, optimizes the token list using stop‑words and part‑of‑speech filtering, visualises word frequencies, and applies SnowNLP to perform sentiment analysis on Weibo comments, complete with code examples and result charts.
1. Word Segmentation
Chinese sentence tokenization, called "分词", can be performed statistically or with a dictionary‑based TF‑IDF approach. The Python jieba library uses a built‑in dictionary and a DAG with dynamic programming to find the most probable word sequence, and also applies an HMM model for unknown words.
Segmentation principle
Jieba loads a dictionary of common Chinese words and part‑of‑speech tags, builds a directed acyclic graph (DAG) of possible word breaks, and selects the path with the highest probability using dynamic programming. An HMM model handles out‑of‑vocabulary terms.
Segmentation code
import jieba # segmentation
with open('text.txt','r',encoding='utf-8') as f:
read = f.read()
word = jieba.cut(read)Printing the result shows many punctuation marks and irrelevant tokens.
2. Optimizing Segmentation
Two steps are applied: (1) remove punctuation and line breaks; (2) filter out words unrelated to sentiment using a stop‑word list.
Using a stop‑word list
import jieba
with open('text.txt','r',encoding='utf-8') as f:
read = f.read()
with open('停用词表.txt','r',encoding='utf-8') as f:
stop_word = f.read()
word = jieba.cut(read)
words = []
for i in list(word):
if i not in stop_word:
words.append(i)After filtering, only meaningful words remain.
Extracting keywords by part of speech
import jieba.posseg as psg
cixing = ()
words = []
for i in psg.cut(read):
cixing = (i.word,i.flag) # word and POS
words.append(cixing)
save = ['a'] # keep adjectives, add more POS as needed
for i in words:
if i[1] in save:
print(i)3. Result Presentation
Word‑frequency bar chart (top 10) is generated with pyecharts:
from pyecharts.charts import Bar
from pyecharts import options as opts
columns = []
data = []
for k,v in dict(Counter(words).most_common(10)).items():
columns.append(k)
data.append(v)
bar = (Bar()
.add_xaxis(columns)
.add_yaxis("词频", data)
.set_global_opts(title_opts=opts.TitleOpts(title="词频top10")))
bar.render("词频.html")The chart shows “头发” as the most frequent term, followed by “考研” and “图书馆”.
4. Sentiment Analysis
Sentiment is evaluated with the snownlp library. Each word is scored; scores >0.7 are counted as positive, <0.3 as negative, otherwise neutral.
from snownlp import SnowNLP
positibe = negtive = middle = 0
for i in words:
pingfen = SnowNLP(i)
if pingfen.sentiments > 0.7:
positibe += 1
elif pingfen.sentiments < 0.3:
negtive += 1
else:
middle += 1The resulting distribution is roughly 32 % positive, 60 % neutral, and 8 % negative.
5. Summary
The article demonstrates how to use Jieba for Chinese word segmentation, improve the token list with stop‑words and POS filtering, visualise word frequencies, and finally apply SnowNLP for sentiment analysis of Weibo comments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
