Artificial Intelligence 10 min read

AIGC Tutorial: Tokenization, POS Tagging, and Named Entity Recognition with Transformers, NLTK, and spaCy

This tutorial introduces AIGC concepts and walks through practical implementations of tokenization, part‑of‑speech tagging, and named entity recognition using the Transformers library, NLTK, and spaCy on Google Colab, complete with code snippets and visual results.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
AIGC Tutorial: Tokenization, POS Tagging, and Named Entity Recognition with Transformers, NLTK, and spaCy

The article begins with a brief introduction to AIGC (Artificial Intelligence Generated Content) and highlights three fundamental NLP tasks—tokenization, part‑of‑speech (POS) tagging, and named entity recognition (NER)—that form the basis for more advanced language processing.

Preparation : Readers are instructed to sign in to Google Colab (https://colab.research.google.com/) to run all examples, which provides free GPU resources for small‑scale computations.

1. Tokenization with the Transformers library

The tutorial demonstrates how to install the transformers package, import BertTokenizer , and use a pre‑trained BERT model ("bert‑base‑uncased") to tokenize a sample sentence.

!pip install transformers
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Transformers make natural language processing tasks easy."
encoding = tokenizer(text, return_tensors='pt', padding=True, truncation=True)

tokens = encoding['input_ids']
decoded_tokens = tokenizer.convert_ids_to_tokens(tokens[0].tolist())
print("Original text:", text)
print("Tokenization result:", decoded_tokens)

The resulting token IDs and their corresponding words are displayed, illustrating how raw text is transformed into model‑compatible inputs.

2. POS Tagging with NLTK

After tokenization, the tutorial shows how to perform POS tagging using NLTK. It covers importing the library, downloading required data, tokenizing the sentence, and applying pos_tag to obtain grammatical tags.

import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

text = "Natural language processing is a fascinating field."
words = word_tokenize(text)
pos_tags = pos_tag(words)
print("Original text:", text)
print("POS tagging result:", pos_tags)

An optional helper function simplify_pos_tag maps the detailed NLTK tags to more readable categories such as Noun, Verb, Adjective, etc.

3. Named Entity Recognition with spaCy

The final section introduces NER using spaCy. It includes installation commands, loading the English model, processing a sample paragraph about Apple Inc., and printing detected entities with their labels.

!pip install spacy
python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in 1976. It is headquartered in Cupertino, California."
doc = nlp(text)
print("Original text:", text)
print("NER results:")
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")

A conversion function convert_label translates spaCy's entity labels (ORG, PERSON, DATE, GPE) into Chinese equivalents (组织, 人名, 日期, 地点) for easier interpretation.

Throughout the tutorial, screenshots are provided to illustrate the output of each step.

Conclusion

The guide wraps up by encouraging readers to experiment with the code, discuss results in the comments, and look forward to future AIGC series posts.

tokenizationAIGCNLPTransformersNamed entity recognitionNLTKPOS taggingspaCy
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.