Artificial Intelligence 8 min read

Boost Chinese Sentiment Analysis: Master Jieba Segmentation and SnowNLP

This tutorial walks through Chinese text tokenization with Jieba, optimizes the token list using stop‑words and part‑of‑speech filtering, visualises word frequencies, and applies SnowNLP to perform sentiment analysis on Weibo comments, complete with code examples and result charts.

Python Crawling & Data Mining

Dec 29, 2021

Boost Chinese Sentiment Analysis: Master Jieba Segmentation and SnowNLP

1. Word Segmentation

Chinese sentence tokenization, called "分词", can be performed statistically or with a dictionary‑based TF‑IDF approach. The Python jieba library uses a built‑in dictionary and a DAG with dynamic programming to find the most probable word sequence, and also applies an HMM model for unknown words.

Segmentation principle

Jieba loads a dictionary of common Chinese words and part‑of‑speech tags, builds a directed acyclic graph (DAG) of possible word breaks, and selects the path with the highest probability using dynamic programming. An HMM model handles out‑of‑vocabulary terms.

Segmentation code

import jieba  # segmentation
with open('text.txt','r',encoding='utf-8') as f:
    read = f.read()
word = jieba.cut(read)

Printing the result shows many punctuation marks and irrelevant tokens.

2. Optimizing Segmentation

Two steps are applied: (1) remove punctuation and line breaks; (2) filter out words unrelated to sentiment using a stop‑word list.

Using a stop‑word list

import jieba
with open('text.txt','r',encoding='utf-8') as f:
    read = f.read()
with open('停用词表.txt','r',encoding='utf-8') as f:
    stop_word = f.read()
word = jieba.cut(read)
words = []
for i in list(word):
    if i not in stop_word:
        words.append(i)

After filtering, only meaningful words remain.

Extracting keywords by part of speech

import jieba.posseg as psg
cixing = ()
words = []
for i in psg.cut(read):
    cixing = (i.word,i.flag)  # word and POS
    words.append(cixing)
save = ['a']  # keep adjectives, add more POS as needed
for i in words:
    if i[1] in save:
        print(i)

3. Result Presentation

Word‑frequency bar chart (top 10) is generated with pyecharts:

from pyecharts.charts import Bar
from pyecharts import options as opts
columns = []
data = []
for k,v in dict(Counter(words).most_common(10)).items():
    columns.append(k)
    data.append(v)
bar = (Bar()
       .add_xaxis(columns)
       .add_yaxis("词频", data)
       .set_global_opts(title_opts=opts.TitleOpts(title="词频top10")))
bar.render("词频.html")

The chart shows “头发” as the most frequent term, followed by “考研” and “图书馆”.

4. Sentiment Analysis

Sentiment is evaluated with the snownlp library. Each word is scored; scores >0.7 are counted as positive, <0.3 as negative, otherwise neutral.

from snownlp import SnowNLP
positibe = negtive = middle = 0
for i in words:
    pingfen = SnowNLP(i)
    if pingfen.sentiments > 0.7:
        positibe += 1
    elif pingfen.sentiments < 0.3:
        negtive += 1
    else:
        middle += 1

The resulting distribution is roughly 32 % positive, 60 % neutral, and 8 % negative.

5. Summary

The article demonstrates how to use Jieba for Chinese word segmentation, improve the token list with stop‑words and POS filtering, visualise word frequencies, and finally apply SnowNLP for sentiment analysis of Weibo comments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

NLP text segmentation

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.