NLP Study Notes: 4 Essential Steps for Preprocessing Chinese Text Corpora
This article walks through the four core steps of Chinese NLP corpus preparation—collecting data, cleaning it with regex and encoding detection, tokenizing using dictionary‑based or statistical methods such as jieba, HMM and CRF, and finally standardizing with stop‑word removal, vocabulary building and one‑hot encoding—while illustrating each step with concrete code snippets and practical examples.
Natural Language Processing (NLP) aims to enable computers to understand, process, and generate human language. When working with Chinese text, a high‑quality corpus is essential, and the preparation process can be broken down into four practical steps.
Step 1 – Collect Data
Sources include open‑source corpora, web crawlers for domain‑specific content, and internal data such as product reviews, social‑media posts, and support tickets. The article emphasizes treating data as a valuable asset and backing up useful corpora under legal conditions.
Step 2 – Clean Data
Cleaning removes useless symbols and normalizes formats. The author follows the rule “Your model is only as good as your data.” Python’s lxml library handles HTML/XML, while regular expressions process plain text. Example code:
import re
re_han_default = re.compile("([\u4E00-\u9FD5]+)", re.U)
sentence = "我/爱/自/然/语/言/处/理"
blocks = re_han_default.split(sentence)
for blk in blocks:
if blk and re_han_default.match(blk):
print(blk)Output:
我
爱
自
然
语
言
处
理Encoding differences are highlighted: Windows defaults to GBK (gb2312) while Linux uses UTF‑8. The chardet library is recommended for automatic detection, usable via chardetect somefile or programmatically.
Step 3 – Tokenization
Chinese tokenization is harder than English because there are no natural delimiters and many ambiguous words. Three typical challenges are identified:
Different segmentation methods (Chinese is more difficult).
English words have many morphological forms, requiring lemmatization and stemming.
Chinese needs to consider segmentation granularity, e.g., “中国科学技术大学” can be split in multiple ways.
Two main families of tokenizers are described:
Dictionary‑matching (forward, reverse, bidirectional) – fast and low‑cost but less adaptable.
Statistical / machine‑learning approaches (HMM, CRF). Tools such as Stanford, HanLP, and especially jieba combine CRF with dictionaries to handle ambiguities and out‑of‑vocabulary words.
Step 4 – Standardization
Standardization prepares data for downstream tasks: removing stop‑words (e.g., “其中”, “况且”, “什么”), building a vocabulary, and converting tokens to numeric vectors. One‑hot encoding is illustrated with a small vocabulary:
我
爱
自然
语言
处理Resulting vectors:
我: [1, 0, 0, 0, 0]
爱: [0, 1, 0, 0, 0]
自然: [0, 0, 1, 0, 0]
语言: [0, 0, 0, 1, 0]
处理: [0, 0, 0, 0, 1]The article notes that one‑hot vectors become sparse and lose semantic information when vocabularies grow, motivating the use of embeddings such as Word2vec and BERT.
Feature Extraction and Learning Paradigms
Features can be extracted manually (statistical) or via embeddings. The author distinguishes supervised learning (labeled data, e.g., sentiment analysis) from unsupervised learning (discovering hidden structures). An example of labeled training data is shown, where each line ends with __label__0 or __label__1 indicating the class.
Images
Overall, the article provides a step‑by‑step guide to preparing Chinese corpora for NLP tasks, covering data acquisition, cleaning, tokenization, standardization, and feature extraction, and it supplies concrete code snippets, examples, and references to popular tools.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Lisa Notes
Lisa's notes: musings on daily life, work, study, personal growth, and casual reflections.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
