Artificial Intelligence 8 min read

Modeling Chinese Word Segmentation with Hidden Markov Models

This article explains how Hidden Markov Models can be used to model Chinese word segmentation, covering the underlying Markov process, model parameters, basic HMM problems, and both supervised and unsupervised training methods.

Hulu Beijing
Hulu Beijing
Hulu Beijing
Modeling Chinese Word Segmentation with Hidden Markov Models

Scene Description

Sequence labeling assigns a label to each element in a sequence and is applied in many NLP tasks such as Chinese word segmentation, POS tagging, semantic role labeling, NER, and speech recognition.

Problem Description

Describe how to model Chinese word segmentation with a Hidden Markov Model (HMM) and how to train the model given a corpus.

Answer and Analysis

Background: An HMM is a classic generative model that assumes a hidden Markov chain generates observable sequences. It is widely used for sequence labeling in NLP and speech.

In a Markov process, the state at time

t_n

depends only on the previous state

t_{n-1}

. Extending this, an HMM introduces hidden states

x_i

that are not directly observable; each hidden state emits an observable output

y_i

. The model parameters include transition probabilities between hidden states, emission probabilities from hidden to observable states, the state space of

x

, the observation space of

y

, and the initial state distribution.

Example: imagine three gourds (hidden states) each containing good or bad medicine (observations). We randomly pick a gourd, draw a medicine, record its type, then possibly transition to another gourd. The hidden state sequence is the gourd identity; the observation sequence is the medicine type.

Using an HMM, the hidden state space is {gourd1, gourd2, gourd3} and the observation space is {good, bad}. The initial distribution reflects the random first pick, transition probabilities model moving between gourds, and emission probabilities model the chance of drawing good or bad medicine from each gourd.

HMMs involve three fundamental problems:

Probability computation: given model parameters, compute the probability of an observation sequence Y (solved by forward‑backward algorithm).

Decoding: given parameters and Y, find the most likely hidden state sequence X (solved by Viterbi algorithm).

Learning: given Y, estimate parameters that maximize its probability (solved by Baum‑Welch/EM algorithm).

Applying this to Chinese word segmentation, each character is an observation. We label characters with B (begin), E (end), M (middle), S (single). The hidden state space is {B, E, M, S}. Transition constraints can be encoded (e.g., B/M can be followed only by M/E, S/E only by B/S). The observation space consists of all Chinese characters in the corpus.

Training can be supervised—using a labeled corpus to count transitions and emissions for maximum‑likelihood estimates—or unsupervised—applying Baum‑Welch to learn parameters from raw text.

Machine LearningNatural Language ProcessingChinese Word SegmentationSequence LabelingHidden Markov Model
Hulu Beijing
Written by

Hulu Beijing

Follow Hulu's official WeChat account for the latest company updates and recruitment information.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.