AIGC Tutorial: Tokenization, POS Tagging, and Named Entity Recognition with Transformers, NLTK, and spaCy
This tutorial introduces AIGC concepts and walks through practical implementations of tokenization, part‑of‑speech tagging, and named entity recognition using the Transformers library, NLTK, and spaCy on Google Colab, complete with code snippets and visual results.
The article begins with a brief introduction to AIGC (Artificial Intelligence Generated Content) and highlights three fundamental NLP tasks—tokenization, part‑of‑speech (POS) tagging, and named entity recognition (NER)—that form the basis for more advanced language processing.
Preparation : Readers are instructed to sign in to Google Colab (https://colab.research.google.com/) to run all examples, which provides free GPU resources for small‑scale computations.
1. Tokenization with the Transformers library
The tutorial demonstrates how to install the transformers package, import BertTokenizer , and use a pre‑trained BERT model ("bert‑base‑uncased") to tokenize a sample sentence.
!pip install transformers from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Transformers make natural language processing tasks easy."
encoding = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
tokens = encoding['input_ids']
decoded_tokens = tokenizer.convert_ids_to_tokens(tokens[0].tolist())
print("Original text:", text)
print("Tokenization result:", decoded_tokens)The resulting token IDs and their corresponding words are displayed, illustrating how raw text is transformed into model‑compatible inputs.
2. POS Tagging with NLTK
After tokenization, the tutorial shows how to perform POS tagging using NLTK. It covers importing the library, downloading required data, tokenizing the sentence, and applying pos_tag to obtain grammatical tags.
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
text = "Natural language processing is a fascinating field."
words = word_tokenize(text)
pos_tags = pos_tag(words)
print("Original text:", text)
print("POS tagging result:", pos_tags)An optional helper function simplify_pos_tag maps the detailed NLTK tags to more readable categories such as Noun, Verb, Adjective, etc.
3. Named Entity Recognition with spaCy
The final section introduces NER using spaCy. It includes installation commands, loading the English model, processing a sample paragraph about Apple Inc., and printing detected entities with their labels.
!pip install spacy
python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in 1976. It is headquartered in Cupertino, California."
doc = nlp(text)
print("Original text:", text)
print("NER results:")
for ent in doc.ents:
print(f"{ent.text} - {ent.label_}")A conversion function convert_label translates spaCy's entity labels (ORG, PERSON, DATE, GPE) into Chinese equivalents (组织, 人名, 日期, 地点) for easier interpretation.
Throughout the tutorial, screenshots are provided to illustrate the output of each step.
Conclusion
The guide wraps up by encouraging readers to experiment with the code, discuss results in the comments, and look forward to future AIGC series posts.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.