Artificial Intelligence 14 min read

Top Python and Java NLP Tools for Chinese Text Processing

This article surveys a wide range of natural language processing libraries—including Python packages like NLTK and spaCy, Java frameworks such as OpenNLP and StanfordNLP, and specialized Chinese tokenizers like IKAnalyzer, ICTCLAS, and FudanNLP—detailing their features, usage, and setup steps for Chinese text analysis.

MaGe Linux Operations

Apr 25, 2017

Top Python and Java NLP Tools for Chinese Text Processing

1 Python NLP Tools

NLTK: a leading Python library offering WordNet access, classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Pattern: provides part‑of‑speech tagging, n‑gram search, sentiment analysis, WordNet, vector‑space models, clustering, and SVM support.

TextBlob: simple APIs for POS tagging, noun‑phrase extraction, sentiment analysis, classification, translation, and more.

Gensim: topic modeling, document indexing, similarity retrieval for large corpora, handling data larger than RAM.

PyNLPI: a collection of NLP tasks including n‑gram search, frequency tables, language modeling, and advanced data structures like priority queues and beam search.

spaCy: a commercial‑open source library built with Python and Cython, delivering industrial‑strength speed and accuracy.

Polyglot: supports massive multilingual processing—tokenization for 165 languages, language identification for 196, named‑entity recognition for 40, POS tagging for 16, sentiment analysis for 136, embeddings for 137, morphological analysis for 135, and translation.

MontyLingua: an end‑to‑end English processing tool extracting semantic tuples, adjectives, nouns, verbs, named entities, dates, and times.

BLLIP Parser (Charniak‑Johnson parser): statistical parser with constituency analysis and max‑entropy ranking, offering both command‑line and Python interfaces.

Quepy: a Python framework that converts natural language queries into database query languages with minimal code changes.

HanLP: a Java toolkit (also callable from Python) providing comprehensive lexical, syntactic, and semantic analysis for production environments.

2 OpenNLP: Chinese Named Entity Recognition

OpenNLP is an Apache Java NLP API with full‑featured capabilities.

Pre‑processing involves tokenizing text and inserting spaces so OpenNLP can treat Chinese input similarly to English.

Entity dictionaries are stored as plain‑text files named after the entity type; two functions load the dictionary words and the entity categories.

Training data must be annotated with and tags indicating entity boundaries and types, e.g.:

XXXXXX<START:Person>????<END>XXXXXXXXX<START:Action>????<END>XXXXXXX

Training the NER model uses custom feature generators, iteration count, cutoff window size, and language code parameters.

iterations: number of training iterations (too few → underfitting, too many → overfitting).

cutoff: size of the n‑gram window (default 5).

langCode: language code; for Chinese use the generic code.

The core training method trainNameEntitySamples() reads the annotated strings, creates a character stream, and calls NameFinderME.train() with the configured parameters.

Source code is available at https://github.com/Ailab403/ailab-mltk4j , with demos and pre‑trained models in the test package.

3 StanfordNLP

The Stanford NLP Group provides several Java tools:

Stanford CoreNLP: tokenization, POS tagging, NER, parsing for English.

Stanford Word Segmenter: CRF‑based tokenizer supporting Chinese and Arabic.

Example programs include the Stanford POS Tagger, Named Entity Recognizer (CRF model), Parser, and Classifier.

Implementing Chinese NER

Download the Stanford segmenter and NER packages, extract them, and place the data files (e.g., ctb.gz, pku.gz) in a data directory.

Configure the classpath with stanford-ner.jar, stanford-segmenter.jar, and related libraries, then run the demo using JUnit.

4 IKAnalyzer

IK Analyzer is an open‑source Java Chinese tokenizer offering fine‑grained and smart segmentation modes, supporting letters, numbers, Chinese, Korean, and Japanese characters. Custom dictionaries are configured via IKAnalyzer.cfg.xml (UTF‑8, one word per line).

5 ICTCLAS

ICTCLAS (also known as NLPIR) is a C++ Chinese segmentation system from the Chinese Academy of Sciences, supporting word segmentation, POS tagging, NER, user dictionaries, and multiple encodings (GBK, UTF‑8, BIG5). Recent versions add micro‑blog segmentation, new‑word discovery, and keyword extraction.

6 FudanNLP

FudanNLP is a Java library for Chinese NLP, released under LGPL‑3.0. It provides information retrieval (text classification, news clustering), Chinese processing (segmentation, POS tagging, NER, keyword extraction, dependency parsing, temporal phrase recognition), and structured learning (online learning, hierarchical classification, clustering, exact inference).

Deploy fudannlp.jar and dependent JARs in the project’s lib directory.

Models for segmentation, POS tagging, and NER reside in the models folder.

Example code and documentation are available in the example and java-docs directories.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Python NLP Chinese IKAnalyzer OpenNLP StanfordNLP

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.