Fundamentals of NLP: Core Tasks, Tool Setup, and Hands‑On Projects
This article introduces the basics of Natural Language Processing, covering core tasks such as language understanding and generation, common applications, essential linguistic analyses, environment setup with Python libraries, hands‑on code examples for preprocessing, POS tagging, NER, sentiment analysis using both classical and transformer models, text generation with GPT‑2, and discusses challenges and Rust‑centric integration strategies.
What is NLP?
Natural Language Processing (NLP) is the discipline that studies how to enable computers to understand, process, and generate human language.
Core Tasks
Language Understanding : sentiment detection, named‑entity extraction, etc.
Language Generation : machine translation, summarisation, conversational agents.
Typical Applications
Voice assistants and translation apps on smartphones.
Spam filtering and keyword matching in search engines.
Intelligent customer‑service chatbots.
NLP Foundations
Basic Concepts
Lexical analysis – splitting text into words or morphemes.
Syntactic analysis – parsing the grammatical structure of sentences.
Semantic analysis – interpreting the meaning of text.
Pragmatic analysis – understanding meaning in specific contexts.
Application Scenarios
Text classification – sentiment analysis, spam detection, news categorisation.
Named‑entity recognition – identifying persons, locations, organisations.
Machine translation – converting text from one language to another.
Question answering – responding to user queries.
Text generation – producing natural‑language sentences.
Environment Setup
Install the required Python libraries:
# Install NLTK
pip install nltk
# Install SpaCy
pip install spacy
# Download SpaCy model
python -m spacy download en_core_web_sm
# Install Transformers
pip install transformers
# Install other utilities
pip install numpy pandas matplotlibBasic Operations
1. Text Pre‑processing with NLTK
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
text = "Hello, world! This is a sample text for natural language processing."
words = word_tokenize(text)
print("Tokenisation result:", words)
sentences = sent_tokenize(text)
print("Sentence split result:", sentences)
stop_words = set(stopwords.words('english'))
filtered_words = [w for w in words if w.lower() not in stop_words]
print("After stop‑word removal:", filtered_words)
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(w) for w in filtered_words]
print("Stemmed words:", stemmed_words)
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(w) for w in filtered_words]
print("Lemmatized words:", lemmatized_words)2. POS Tagging with SpaCy
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("Hello, world! This is a sample text for natural language processing.")
for token in doc:
print(f"{token.text} - {token.pos_} - {token.dep_}")3. Named‑Entity Recognition (NER)
Using the Portuguese BERT model BERTimbau Base (also known as bert-base-portuguese-cased ) which achieves state‑of‑the‑art results on NER, sentence similarity and textual entailment.
Example with SpaCy:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
print(f"{ent.text} - {ent.label_}")Practical Project – Sentiment Analysis
Data Preparation (IMDB dataset)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
data = pd.read_csv('imdb.csv')
X_train, X_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=42)
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)Model Training – Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
y_pred = model.predict(X_test_vectorized)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))Deep‑Learning Approach – BERT
from transformers import BertTokenizer, BertForSequenceClassification
import torch
from torch.utils.data import DataLoader, Dataset
class SentimentDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_length):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = self.texts[idx]
label = self.labels[idx]
encoding = self.tokenizer(text, truncation=True, padding='max_length', max_length=self.max_length, return_tensors='pt')
return {'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'label': torch.tensor(label)}
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
train_dataset = SentimentDataset(X_train.tolist(), y_train.tolist(), tokenizer, max_length=128)
test_dataset = SentimentDataset(X_test.tolist(), y_test.tolist(), tokenizer, max_length=128)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)
criterion = torch.nn.CrossEntropyLoss()
epochs = 3
for epoch in range(epochs):
model.train()
running_loss = 0.0
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
optimizer.zero_grad()
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f'Epoch {epoch+1}, Loss: {running_loss/len(train_loader):.3f}')
model.eval()
correct, total = 0, 0
with torch.no_grad():
for batch in test_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
outputs = model(input_ids, attention_mask=attention_mask)
_, predicted = torch.max(outputs.logits, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print(f'Test accuracy: {100 * correct / total:.2f}%')Practical Project – Text Generation with GPT‑2
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(inputs['input_ids'], max_length=100, num_return_sequences=1, no_repeat_ngram_size=2, do_sample=True, temperature=0.7)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)Opportunities and Challenges
Ambiguity – the same word or sentence can have multiple meanings.
Data sparsity – some linguistic phenomena have very few examples.
Computational complexity – processing long texts is resource‑intensive.
Multilingual support – large grammatical differences across languages.
Solutions
Contextual understanding – use surrounding text to resolve ambiguity.
Data augmentation – generate more training samples.
Model optimisation – adopt more efficient architectures and algorithms.
Multilingual models – employ pretrained models that support many languages.
Rust‑Centric Perspective
Performance optimisation : implement high‑speed text processing and model inference in Rust; leverage Rust’s memory safety to avoid leaks.
Cross‑language integration :
Use PyO3 to call Rust code from Python.
Compile Rust‑based NLP functions to WebAssembly for browser use.
Employ gRPC for communication between Rust services and Python pipelines.
Illustrative Images
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Lisa Notes
Lisa's notes: musings on daily life, work, study, personal growth, and casual reflections.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
