Artificial Intelligence 5 min read

Basic Natural Language Processing: Text Preprocessing and TF‑IDF with Python

This tutorial introduces fundamental natural language processing techniques, covering text preprocessing steps such as tokenization and stop‑word removal, followed by TF‑IDF feature extraction, and provides complete Python code examples to practice these concepts on a sample dataset.

Test Development Learning Exchange

Nov 27, 2024

Basic Natural Language Processing: Text Preprocessing and TF‑IDF with Python

Goal: Learn basic natural language processing techniques, including text preprocessing and TF‑IDF.

Learning Content: Text preprocessing (tokenization, stop‑word removal) and TF‑IDF calculation.

Code Example:

import pandas as pd
import numpy as np
import re
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Create example dataset
data = {
    'text': [
        "这是一个示例句子，用于展示文本预处理。",
        "自然语言处理是一门非常有趣的学科。",
        "通过分词和去停用词，我们可以提取文本的重要特征。",
        "TF-IDF 是一种常用的文本特征提取方法。"
    ],
    'label': [1, 0, 1, 0]
}
df = pd.DataFrame(data)
print(f"示例数据集: 
{df}")

# Tokenization
def tokenize(text):
    return " ".join(jieba.cut(text))

df['tokenized_text'] = df['text'].apply(tokenize)
print(f"分词后的数据集: 
{df}")

# Stop‑word removal
stopwords = set(['的','是','一','这','用','可以','我们','通过'])
def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in stopwords])

df['cleaned_text'] = df['tokenized_text'].apply(remove_stopwords)
print(f"去除停用词后的数据集: 
{df}")

# TF‑IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['cleaned_text'])
print(f"TF‑IDF 矩阵的形状: {tfidf_matrix.shape}")
feature_names = tfidf_vectorizer.get_feature_names_out()
print(f"特征名称: {feature_names}")
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
print(f"TF‑IDF 矩阵: 
{tfidf_df}")

Practice: Apply the same preprocessing and TF‑IDF computation to a sample text dataset using the provided code.

Summary: After completing the exercise, you should be able to perform tokenization, stop‑word removal, and TF‑IDF calculation, which are fundamental techniques in natural language processing and can be applied to real projects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python NLP TF-IDF Scikit-learn jieba Text preprocessing

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.