Artificial Intelligence 27 min read

Sequence Labeling in Natural Language Processing: Definitions, Tag Schemes, Model Choices, and Practical Implementation

This article provides a comprehensive overview of sequence labeling tasks in NLP, covering their definition, common tag schemes (BIO, BIEO, BIESO), comparisons with other NLP tasks, major modeling approaches such as HMM, CRF, RNN and BERT, real‑world applications like POS tagging, NER, event extraction and gene analysis, and a step‑by‑step PyTorch implementation with dataset preparation, training pipeline, and evaluation metrics.

DataFunSummit
DataFunSummit
DataFunSummit
Sequence Labeling in Natural Language Processing: Definitions, Tag Schemes, Model Choices, and Practical Implementation

Sequence labeling is a core NLP task that assigns a label to each element of an input token sequence, producing an output label sequence of the same length. It is essential for applications such as named entity recognition (NER), part‑of‑speech (POS) tagging, event element extraction, and even gene sequence analysis.

Tag Schemes – The most widely used labeling schemes are BIO (Begin, Inside, Outside), BIEO (Begin, Inside, End, Outside) and BIESO (Begin, Inside, End, Single, Outside). BIO distinguishes only the start of an entity, BIEO adds an explicit end tag, and BIESO further introduces a single‑character tag for entities consisting of one token, improving model discrimination for short entities.

Model Choices – Traditional statistical models include Hidden Markov Models (HMM) and Conditional Random Fields (CRF). HMM models joint probabilities of observations and hidden states, while CRF is a discriminative model that scores feature functions over the whole sequence. Deep learning approaches use Recurrent Neural Networks (RNN) such as LSTM/GRU, which capture long‑range context, and the recent BERT model, which provides powerful pre‑trained contextual embeddings. Combining BERT with BiLSTM‑CRF yields state‑of‑the‑art performance on NER.

Comparison with Other NLP Tasks – Unlike text classification (single output) and machine translation or dialogue generation (variable‑length output), sequence labeling requires a label for every input token, making it more sensitive to local context and less dependent on external knowledge.

Applications – POS tagging assigns grammatical categories to each word; NER identifies person, location, and organization names; event extraction extracts structured event arguments; gene sequence analysis labels biological sequences for downstream analysis.

Practical Implementation – The article demonstrates a full PyTorch workflow using the Chinese NER dataset derived from the 1998 People’s Daily corpus. Data preprocessing converts the original segmentation into BIO tags, handling special cases for person and organization names. The dataset is split into training (70%), validation (21%) and test (9%) sets, with vocabulary built only on the training split. Training uses TorchText’s BucketIterator to create length‑sorted batches, reducing padding overhead. The training loop prints average loss every 10 epochs and moves data to GPU when available. Evaluation computes token‑level accuracy, recall, and F1, with safeguards against division‑by‑zero.

Key Takeaways – Selecting appropriate tag schemes and models depends on data size, computational resources, and latency requirements. While BERT offers superior accuracy, it demands more training data and inference time; simpler models like CRF or BiLSTM‑CRF may be preferable in resource‑constrained settings.

NLPBERTNamed entity recognitionRNNHMMCRFsequence labeling
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.