How to Analyze Chinese Sentiment Text Data: From Stats to Word Clouds

This article guides Java developers through a complete Chinese sentiment‑analysis dataset exploration, covering label distribution, sentence length statistics, vocabulary counts, adjective extraction, and visual word‑cloud generation using Python libraries such as pandas, seaborn, jieba, and wordcloud.

JavaEdge
JavaEdge
JavaEdge
How to Analyze Chinese Sentiment Text Data: From Stats to Word Clouds

Purpose

The article demonstrates how to explore a Chinese sentiment‑analysis corpus, identify potential data quality issues, and obtain quantitative guidance for hyper‑parameter choices in downstream model training.

Common Text‑Data Analysis Methods

Label distribution statistics.

Sentence‑length distribution.

Word‑frequency analysis and adjective word‑cloud visualization.

Dataset Description

The dataset consists of two tab‑separated files for binary sentiment classification: train.tsv – training set. dev.tsv – validation set (same format as training).

Each line contains a comment text and a label (0 = negative, 1 = positive).

Data Analysis

Label Distribution

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
train_data = pd.read_csv("./cn_data/train.tsv", sep="\t")
valid_data = pd.read_csv("./cn_data/dev.tsv", sep="\t")

sns.countplot(x="label", data=train_data)
plt.title("Training set label distribution")
plt.show()

sns.countplot(x="label", data=valid_data)
plt.title("Validation set label distribution")
plt.show()

Both sets are roughly balanced (≈ 1:1), which is desirable for a baseline accuracy around 50 %.

Sentence‑Length Distribution

# Add sentence length column
train_data["sentence_length"] = list(map(lambda x: len(x), train_data["sentence"]))
valid_data["sentence_length"] = list(map(lambda x: len(x), valid_data["sentence"]))

# Count plot (focus on y‑axis)
sns.countplot("sentence_length", data=train_data)
plt.xticks([])
plt.show()

# Density plot (focus on x‑axis)
sns.distplot(train_data["sentence_length"])
plt.yticks([])
plt.show()

# Repeat for validation data
sns.countplot("sentence_length", data=valid_data)
plt.xticks([])
plt.show()

sns.distplot(valid_data["sentence_length"])
plt.yticks([])
plt.show()

Most sentences fall in the 20‑250 token range, informing the choice of a fixed input length for models.

Length vs. Label Scatter Plot

sns.stripplot(x='label', y='sentence_length', data=train_data)
plt.title("Training set length by label")
plt.show()

sns.stripplot(x='label', y='sentence_length', data=valid_data)
plt.title("Validation set length by label")
plt.show()

Scatter plots expose outliers (e.g., a positive sample with length ≈ 3500) that should be inspected manually.

Vocabulary Size

import jieba
from itertools import chain

train_vocab = set(chain(*map(lambda x: jieba.lcut(x), train_data["sentence"])) )
print("Training vocab size:", len(train_vocab))

valid_vocab = set(chain(*map(lambda x: jieba.lcut(x), valid_data["sentence"])) )
print("Validation vocab size:", len(valid_vocab))

Training set contains 12 147 unique tokens; validation set contains 6 857.

High‑Frequency Adjective Word Cloud (Training Set)

import jieba.posseg as pseg
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Extract adjectives
def get_a_list(text):
    return [g.word for g in pseg.lcut(text) if g.flag == "a"]

# Word‑cloud generator
def get_word_cloud(keywords_list):
    wc = WordCloud(font_path="./SimHei.ttf", max_words=100, background_color="white")
    wc.generate(" ".join(keywords_list))
    plt.figure()
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.show()

# Positive samples
p_train = train_data[train_data["label"] == 1]["sentence"]
train_p_a = chain(*map(get_a_list, p_train))

# Negative samples
n_train = train_data[train_data["label"] == 0]["sentence"]
train_n_a = chain(*map(get_a_list, n_train))

get_word_cloud(train_p_a)
get_word_cloud(train_n_a)

The positive‑sample cloud is dominated by positive adjectives; the negative‑sample cloud shows mainly negative adjectives, with a few unexpected positive words (e.g., “便利”) that merit manual review.

Adjective Word Cloud (Validation Set)

# Positive samples
p_valid = valid_data[valid_data["label"] == 1]["sentence"]
valid_p_a = chain(*map(get_a_list, p_valid))

# Negative samples
n_valid = valid_data[valid_data["label"] == 0]["sentence"]
valid_n_a = chain(*map(get_a_list, n_valid))

get_word_cloud(valid_p_a)
get_word_cloud(valid_n_a)

The same polarity pattern holds for the validation set, confirming overall corpus quality.

Summary

The workflow provides a reproducible pipeline for label statistics, length analysis, vocabulary counting, and adjective word‑cloud visualization, enabling Java developers to assess corpus quality and set appropriate hyper‑parameters before model training.

Source code and examples are available at https://github.com/Java-Interview-Tutorial

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonSentiment AnalysisNLPData Visualizationjiebatext analysiswordcloud
JavaEdge
Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.