Basic Natural Language Processing: Text Preprocessing and TF‑IDF with Python
This tutorial introduces fundamental natural language processing techniques, covering text preprocessing steps such as tokenization and stop‑word removal, followed by TF‑IDF feature extraction, and provides complete Python code examples to practice these concepts on a sample dataset.
Goal: Learn basic natural language processing techniques, including text preprocessing and TF‑IDF.
Learning Content: Text preprocessing (tokenization, stop‑word removal) and TF‑IDF calculation.
Code Example:
import pandas as pd
import numpy as np
import re
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
# Create example dataset
data = {
'text': [
"这是一个示例句子,用于展示文本预处理。",
"自然语言处理是一门非常有趣的学科。",
"通过分词和去停用词,我们可以提取文本的重要特征。",
"TF-IDF 是一种常用的文本特征提取方法。"
],
'label': [1, 0, 1, 0]
}
df = pd.DataFrame(data)
print(f"示例数据集: \n{df}")
# Tokenization
def tokenize(text):
return " ".join(jieba.cut(text))
df['tokenized_text'] = df['text'].apply(tokenize)
print(f"分词后的数据集: \n{df}")
# Stop‑word removal
stopwords = set(['的','是','一','这','用','可以','我们','通过'])
def remove_stopwords(text):
return " ".join([word for word in text.split() if word not in stopwords])
df['cleaned_text'] = df['tokenized_text'].apply(remove_stopwords)
print(f"去除停用词后的数据集: \n{df}")
# TF‑IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['cleaned_text'])
print(f"TF‑IDF 矩阵的形状: {tfidf_matrix.shape}")
feature_names = tfidf_vectorizer.get_feature_names_out()
print(f"特征名称: {feature_names}")
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
print(f"TF‑IDF 矩阵: \n{tfidf_df}")Practice: Apply the same preprocessing and TF‑IDF computation to a sample text dataset using the provided code.
Summary: After completing the exercise, you should be able to perform tokenization, stop‑word removal, and TF‑IDF calculation, which are fundamental techniques in natural language processing and can be applied to real projects.
Test Development Learning Exchange
Test Development Learning Exchange
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.