Common Data Cleaning Techniques with Python Code Examples
This article presents a comprehensive collection of Python code snippets demonstrating essential data cleaning methods—including handling missing values, outlier detection, type conversion, formatting, duplicate removal, normalization, one‑hot encoding, text preprocessing, and dataset merging—providing practical guidance for preparing data for analysis or machine‑learning tasks.
This guide outlines practical data‑cleaning operations using Python's pandas, NumPy, and scikit‑learn libraries, organized into nine common techniques.
1. Missing Value Handling
import pandas as pd
# Delete rows with missing values
df.dropna(axis=0, inplace=True)
# Delete columns with missing values
df.dropna(axis=1, inplace=True)
# Fill missing values with mean
df['column'].fillna(df['column'].mean(), inplace=True)
# Fill missing values with median
df['column'].fillna(df['column'].median(), inplace=True)
# Fill missing values with mode
df['column'].fillna(df['column'].mode()[0], inplace=True)2. Outlier Handling
import pandas as pd
import numpy as np
# Standard deviation method
std = df['column'].std()
mean = df['column'].mean()
threshold = 3 * std
df.loc[np.abs(df['column'] - mean) > threshold, 'column'] = mean
# Box‑plot method
q1 = df['column'].quantile(0.25)
q3 = df['column'].quantile(0.75)
iqr = q3 - q1
lower_limit = q1 - 1.5 * iqr
upper_limit = q3 + 1.5 * iqr
df.loc[df['column'] < lower_limit, 'column'] = lower_limit
df.loc[df['column'] > upper_limit, 'column'] = upper_limit3. Data Type Conversion
import pandas as pd
# Convert to integer
df['column'] = df['column'].astype(int)
# Convert to datetime
df['date_column'] = pd.to_datetime(df['date_column'])4. Data Formatting
import pandas as pd
# Strip whitespace
df['column'] = df['column'].str.strip()
# Convert to lower case
df['column'] = df['column'].str.lower()
# Convert to upper case
df['column'] = df['column'].str.upper()5. Duplicate Handling
import pandas as pd
# Drop completely duplicate rows
df.drop_duplicates(inplace=True)
# Drop duplicates based on specific columns
df.drop_duplicates(subset=['column1', 'column2'], inplace=True)6. Data Normalization
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
# Example DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]})
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
df_scaled = pd.DataFrame(scaled_data, columns=df.columns)
print(df_scaled)7. One‑Hot Encoding
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# Example categorical DataFrame
df = pd.DataFrame({'A': ['red', 'green', 'blue'], 'B': ['small', 'medium', 'large']})
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(df).toarray()
df_encoded = pd.DataFrame(encoded_data, columns=encoder.get_feature_names(df.columns))
print(df_encoded)8. Text Data Cleaning
import re
# Remove specific text
text = 'Hello, World!'
cleaned_text = text.replace('Hello', '')
print(cleaned_text)
# Extract date pattern
text = 'Today is 2021-01-01'
pattern = r'\d{4}-\d{2}-\d{2}'
matches = re.findall(pattern, text)
print(matches)9. Data Merging
# Merge on a key
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
df2 = pd.DataFrame({'A': [4, 5, 6], 'B': ['d', 'e', 'f']})
merged_df = pd.merge(df1, df2, on='A')
print(merged_df)
# Concatenate vertically
concatenated_df = pd.concat([df1, df2], axis=0)
print(concatenated_df)These snippets illustrate typical Python approaches for data normalization, text cleaning, and dataset merging, which can be adapted to the specifics of any data set using powerful libraries such as pandas, NumPy, and regular‑expression modules.
Test Development Learning Exchange
Test Development Learning Exchange
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.