Python Techniques for Comprehensive Text Data Analysis
This guide demonstrates how to use Python for end‑to‑end text data analysis, covering preprocessing, word‑frequency visualization, classification, sentiment detection, similarity measurement, entity recognition, keyword extraction, summarization, translation, and generation with clear code examples.
Utilizing Python for text data analysis enables extraction of valuable information, classification, sentiment analysis, entity recognition, and more.
1. Text Preprocessing – Clean and tokenize text, remove stopwords.
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def preprocess_text(text):
# Remove special characters and punctuation
text = re.sub(r'[^\w\s]', '', text)
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
return filtered_tokens2. Word Frequency & Word Cloud – Count token frequencies and visualize them.
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
def word_frequency(text):
tokens = preprocess_text(text)
word_counts = Counter(tokens)
wordcloud = WordCloud(width=800, height=400).generate_from_frequencies(word_counts)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()3. Text Classification – Use a Naïve Bayes classifier with TF‑IDF features.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def text_classification(texts, labels):
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
return accuracy4. Sentiment Analysis – Apply NLTK’s SentimentIntensityAnalyzer.
from nltk.sentiment import SentimentIntensityAnalyzer
def sentiment_analysis(text):
analyzer = SentimentIntensityAnalyzer()
sentiment = analyzer.polarity_scores(text)
return sentiment5. Text Similarity – Compute cosine similarity between two texts.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def text_similarity(text1, text2):
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform([text1, text2])
similarity = cosine_similarity(X[0], X[1])
return similarity[0][0]6. Entity Recognition – Use spaCy to extract named entities.
import spacy
def entity_recognition(text):
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]
return entities7. Keyword Extraction – Retrieve top TF‑IDF keywords.
from sklearn.feature_extraction.text import TfidfVectorizer
def keyword_extraction(text):
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform([text])
feature_names = vectorizer.get_feature_names()
top_keywords = [feature_names[idx] for idx in X.toarray().argsort()[0][-5:][::-1]]
return top_keywords8. Text Summarization – Generate a summary with the TextRank algorithm via Gensim.
from gensim.summarization import summarize
def text_summarization(text):
summary = summarize(text)
return summary9. Text Translation – Translate text using Google Translate API.
from googletrans import Translator
def text_translation(text, target_language):
translator = Translator()
translation = translator.translate(text, dest=target_language)
return translation.text10. Text Generation – Produce text with OpenAI’s GPT models.
import openai
def text_generation(prompt):
openai.api_key = 'your_api_key'
response = openai.Completion.create(
engine='text-davinci-003',
prompt=prompt,
max_tokens=100
)
generated_text = response.choices[0].text.strip()
return generated_textThese examples illustrate practical Python workflows for a wide range of NLP tasks; adapt and combine them according to the specific characteristics of your dataset and analysis goals.
Test Development Learning Exchange
Test Development Learning Exchange
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.