Fundamentals 7 min read

Common Data Cleaning Techniques with Python Code Examples

This article presents a comprehensive collection of Python code snippets demonstrating essential data cleaning methods—including handling missing values, outlier detection, type conversion, formatting, duplicate removal, normalization, one‑hot encoding, text preprocessing, and dataset merging—providing practical guidance for preparing data for analysis or machine‑learning tasks.

Test Development Learning Exchange

Dec 4, 2023

Common Data Cleaning Techniques with Python Code Examples

This guide outlines practical data‑cleaning operations using Python's pandas, NumPy, and scikit‑learn libraries, organized into nine common techniques.

1. Missing Value Handling

import pandas as pd
# Delete rows with missing values
df.dropna(axis=0, inplace=True)
# Delete columns with missing values
df.dropna(axis=1, inplace=True)
# Fill missing values with mean
df['column'].fillna(df['column'].mean(), inplace=True)
# Fill missing values with median
df['column'].fillna(df['column'].median(), inplace=True)
# Fill missing values with mode
df['column'].fillna(df['column'].mode()[0], inplace=True)

2. Outlier Handling

import pandas as pd
import numpy as np
# Standard deviation method
std = df['column'].std()
mean = df['column'].mean()
threshold = 3 * std
df.loc[np.abs(df['column'] - mean) > threshold, 'column'] = mean
# Box‑plot method
q1 = df['column'].quantile(0.25)
q3 = df['column'].quantile(0.75)
iqr = q3 - q1
lower_limit = q1 - 1.5 * iqr
upper_limit = q3 + 1.5 * iqr
df.loc[df['column'] < lower_limit, 'column'] = lower_limit
df.loc[df['column'] > upper_limit, 'column'] = upper_limit

3. Data Type Conversion

import pandas as pd
# Convert to integer
df['column'] = df['column'].astype(int)
# Convert to datetime
df['date_column'] = pd.to_datetime(df['date_column'])

4. Data Formatting

import pandas as pd
# Strip whitespace
df['column'] = df['column'].str.strip()
# Convert to lower case
df['column'] = df['column'].str.lower()
# Convert to upper case
df['column'] = df['column'].str.upper()

5. Duplicate Handling

import pandas as pd
# Drop completely duplicate rows
df.drop_duplicates(inplace=True)
# Drop duplicates based on specific columns
df.drop_duplicates(subset=['column1', 'column2'], inplace=True)

6. Data Normalization

from sklearn.preprocessing import MinMaxScaler
import pandas as pd
# Example DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]})
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
df_scaled = pd.DataFrame(scaled_data, columns=df.columns)
print(df_scaled)

7. One‑Hot Encoding

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# Example categorical DataFrame
df = pd.DataFrame({'A': ['red', 'green', 'blue'], 'B': ['small', 'medium', 'large']})
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(df).toarray()
df_encoded = pd.DataFrame(encoded_data, columns=encoder.get_feature_names(df.columns))
print(df_encoded)

8. Text Data Cleaning

import re
# Remove specific text
text = 'Hello, World!'
cleaned_text = text.replace('Hello', '')
print(cleaned_text)
# Extract date pattern
text = 'Today is 2021-01-01'
pattern = r'\d{4}-\d{2}-\d{2}'
matches = re.findall(pattern, text)
print(matches)

9. Data Merging

# Merge on a key
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
df2 = pd.DataFrame({'A': [4, 5, 6], 'B': ['d', 'e', 'f']})
merged_df = pd.merge(df1, df2, on='A')
print(merged_df)
# Concatenate vertically
concatenated_df = pd.concat([df1, df2], axis=0)
print(concatenated_df)

These snippets illustrate typical Python approaches for data normalization, text cleaning, and dataset merging, which can be adapted to the specifics of any data set using powerful libraries such as pandas, NumPy, and regular‑expression modules.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning data cleaning Data preprocessing Pandas

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.