Common Data Preprocessing Techniques with Python Code Examples
This article presents ten essential data preprocessing methods—including handling missing values, type conversion, standardization, encoding, smoothing, outlier treatment, text cleaning, word frequency counting, sentiment analysis, and topic modeling—each explained with clear Python code snippets.
Data preprocessing is a crucial step in data analysis, helping to clean, transform, and prepare data for subsequent analysis.
1. Missing Value Handling – Handling missing values by filling or dropping them prevents errors or bias during analysis.
import pandas as pd
# Sample dataset with missing values
data = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, 7, 8]})
# Fill missing values with a specific value (e.g., 0)
data_filled = data.fillna(0)
# Drop rows containing missing values
data_dropped = data.dropna()
print("Filled data:")
print(data_filled)
print("Dropped data:")
print(data_dropped)2. Data Type Conversion – Converting data types to the correct format (e.g., strings to numeric) ensures accuracy and consistency.
import pandas as pd
# Sample dataset with string numbers
data = pd.DataFrame({'A': ['1', '2', '3', '4'], 'B': ['5', '6', '7', '8']})
# Convert columns to integers
data['A'] = data['A'].astype(int)
data['B'] = data['B'].astype(int)
print("Data types after conversion:")
print(data.dtypes)3. Data Standardization – Standardizing data gives all features the same scale and distribution, eliminating unit differences.
from sklearn.preprocessing import StandardScaler
# Sample data to be standardized
data = [[1, 2], [2, 4], [3, 6], [4, 8]]
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
print("Standardized data:")
print(data_scaled)4. Feature Encoding – Encoding categorical variables into numeric form enables their use in modeling and analysis.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Sample dataset with a categorical column
data = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})
encoder = LabelEncoder()
data['Color_encoded'] = encoder.fit_transform(data['Color'])
print("Data after encoding:")
print(data)5. Data Smoothing – Applying smoothing (e.g., moving average) reduces noise and outliers, making trends easier to detect.
import pandas as pd
# Sample data with a sudden spike
data = pd.DataFrame({'Value': [10, 20, 30, 200, 40]})
# Apply moving average smoothing with a window of 3
data['Smoothed'] = data['Value'].rolling(window=3, min_periods=1).mean()
print("Smoothed data:")
print(data)6. Outlier Handling – Detecting and correcting outliers prevents them from disproportionately influencing analysis results.
import pandas as pd
import numpy as np
# Sample data containing an outlier
data = pd.DataFrame({'Value': [10, 20, 30, 200, 40]})
mean = np.mean(data['Value'])
std = np.std(data['Value'])
threshold = 3 * std
# Identify outliers
outliers = data[data['Value'] > mean + threshold]
# Replace outliers with the mean value
data.loc[data['Value'] > mean + threshold, 'Value'] = mean
print("Data after outlier handling:")
print(data)7. Text Cleaning – Removing special characters, stop words, and other noise improves the quality of textual data for analysis.
import re
text = "This is an example sentence with special characters!@#$"
# Remove special characters
cleaned_text = re.sub('[^a-zA-Z0-9\s]', '', text)
print("Cleaned text:")
print(cleaned_text)8. Word Frequency Counting – Counting word frequencies helps identify key terms and popular topics in a text corpus.
from collections import Counter
# Sample list of sentences
text_data = ["This is an example sentence.", "Another sentence for example."]
words = [word for sentence in text_data for word in sentence.split()]
word_counts = Counter(words)
print("Word frequency results:")
print(word_counts)9. Sentiment Analysis – Analyzing sentiment (positive, negative, neutral) reveals emotional attitudes toward products, services, or events.
from textblob import TextBlob
text_data = ["I love this product!", "This movie is terrible.", "The customer service was average."]
sentiments = [TextBlob(text).sentiment.polarity for text in text_data]
print("Sentiment analysis results:")
for i, sentiment in enumerate(sentiments):
if sentiment > 0:
print(f"Text {i+1}: Positive")
elif sentiment < 0:
print(f"Text {i+1}: Negative")
else:
print(f"Text {i+1}: Neutral")10. Topic Modeling – Extracting topics from text data uncovers hidden themes and keywords.
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
# Sample list of sentences
text_data = ["The weather is sunny and warm.", "I love to go swimming in the ocean.", "Hiking in the mountains is my favorite activity."]
vectorizer = CountVectorizer()
word_counts = vectorizer.fit_transform(text_data)
lda = LatentDirichletAllocation(n_components=2)
topics = lda.fit_transform(word_counts)
print("Topic modeling results:")
for i, topic_dist in enumerate(topics):
topic_index = topic_dist.argmax()
print(f"Text {i+1} topic: {topic_index}")These code examples help you master data preprocessing techniques and optimize data analysis workflows, covering missing value handling, type conversion, standardization, encoding, smoothing, outlier treatment, text cleaning, word frequency counting, sentiment analysis, and topic modeling.
Test Development Learning Exchange
Test Development Learning Exchange
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.