Artificial Intelligence 10 min read

AIGC Tutorial: Tokenization, POS Tagging, and Named Entity Recognition with Transformers, NLTK, and spaCy

This tutorial introduces AIGC concepts and walks through practical implementations of tokenization, part‑of‑speech tagging, and named entity recognition using the Transformers library, NLTK, and spaCy on Google Colab, complete with code snippets and visual results.

Rare Earth Juejin Tech Community

Dec 15, 2023

AIGC Tutorial: Tokenization, POS Tagging, and Named Entity Recognition with Transformers, NLTK, and spaCy

The article begins with a brief introduction to AIGC (Artificial Intelligence Generated Content) and highlights three fundamental NLP tasks—tokenization, part‑of‑speech (POS) tagging, and named entity recognition (NER)—that form the basis for more advanced language processing.

Preparation : Readers are instructed to sign in to Google Colab (https://colab.research.google.com/) to run all examples, which provides free GPU resources for small‑scale computations.

1. Tokenization with the Transformers library

The tutorial demonstrates how to install the transformers package, import BertTokenizer, and use a pre‑trained BERT model ("bert‑base‑uncased") to tokenize a sample sentence.

!pip install transformers

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Transformers make natural language processing tasks easy."
encoding = tokenizer(text, return_tensors='pt', padding=True, truncation=True)

tokens = encoding['input_ids']
decoded_tokens = tokenizer.convert_ids_to_tokens(tokens[0].tolist())
print("Original text:", text)
print("Tokenization result:", decoded_tokens)

The resulting token IDs and their corresponding words are displayed, illustrating how raw text is transformed into model‑compatible inputs.

2. POS Tagging with NLTK

After tokenization, the tutorial shows how to perform POS tagging using NLTK. It covers importing the library, downloading required data, tokenizing the sentence, and applying pos_tag to obtain grammatical tags.

import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

text = "Natural language processing is a fascinating field."
words = word_tokenize(text)
pos_tags = pos_tag(words)
print("Original text:", text)
print("POS tagging result:", pos_tags)

An optional helper function simplify_pos_tag maps the detailed NLTK tags to more readable categories such as Noun, Verb, Adjective, etc.

3. Named Entity Recognition with spaCy

The final section introduces NER using spaCy. It includes installation commands, loading the English model, processing a sample paragraph about Apple Inc., and printing detected entities with their labels.

!pip install spacy
python -m spacy download en_core_web_sm

import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in 1976. It is headquartered in Cupertino, California."
doc = nlp(text)
print("Original text:", text)
print("NER results:")
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")

A conversion function convert_label translates spaCy's entity labels (ORG, PERSON, DATE, GPE) into Chinese equivalents (组织, 人名, 日期, 地点) for easier interpretation.

Throughout the tutorial, screenshots are provided to illustrate the output of each step.

Conclusion

The guide wraps up by encouraging readers to experiment with the code, discuss results in the comments, and look forward to future AIGC series posts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Tokenization AIGC NLP Transformers Named Entity Recognition NLTK POS tagging spaCy

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.