Fundamentals 16 min read

Data Cleaning Techniques in Python: 21 Practical Examples and Code

This article provides a comprehensive guide to data cleaning in Python, covering common data issues, methods for handling missing values, duplicates, categorical inconsistencies, and text normalization, illustrated with 21 detailed code examples using pandas and matplotlib.

Python Programming Learning Circle

Apr 27, 2024

Data Cleaning Techniques in Python: 21 Practical Examples and Code

Data cleaning is the process of identifying and correcting errors and inconsistencies in datasets to enable reliable analysis and reporting.

Key characteristics of clean data include accuracy, completeness, consistency, integrity, timeliness, uniformity, and validity.

The article explores four broad topics with practical Python examples:

Common data problems such as type mismatches, range violations, and duplicate records.

Text and categorical data issues, including inconsistent categories and relationship constraints.

Advanced data problems like cross‑domain validation and overall integrity.

Record linking techniques for merging and reconciling datasets.

Example 1 – Type conversion

# Import CSV file and output header
sales = pd.read_csv('sales.csv')
sales.head(2)
# Get data types of columns
sales.dtypes

When a numeric column is stored as a string (e.g., '$1000'), the article shows how to strip the dollar sign and convert the column to int before aggregation.

sales['Revenue'] = sales['Revenue'].str.strip('$')
sales['Revenue'] = sales['Revenue'].astype('int')

Example 2 – Categorical encoding errors

# 0 = Never married 1 = Married 2 = Separated 3 = Divorced

It demonstrates converting a numeric representation of marital status to a categorical type and re‑running descriptive statistics.

df["marriage_status"] = df["marriage_status"].astype('category')
df.describe()

Example 3 – Out‑of‑range values

# Output Movies with rating > 5
movies[movies['avg_rating'] > 5]

Outliers are filtered or capped using pandas indexing and .loc assignment.

# Convert avg_rating > 5 to 5
movies.loc[movies['avg_rating'] > 5, 'avg_rating'] = 5

Example 4 – Future dates

import datetime as dt
today_date = dt.date.today()
user_signups = user_signups[user_signups['subscription_date'] < today_date]

Dates stored as strings are converted to datetime objects before filtering.

Duplicate detection

duplicates = height_weight.duplicated()
print(duplicates)

The article explains using duplicated() with subset and keep parameters, and removing duplicates via drop_duplicates(). height_weight.drop_duplicates(inplace = True) Category consistency

# Identify inconsistent categories
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories)

Inconsistent rows are isolated and either corrected or removed using boolean indexing.

consistent_data = study_data[~inconsistent_rows]

Text data cleaning

# Standardize case
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.upper()
# Strip whitespace
demographics = demographics['marriage_status'].str.strip()

Phone numbers are normalized by replacing '+' with '00', removing dashes, and setting short entries to NaN.

# Replace '+' with '00'
phones["Phone number"] = phones["Phone number"].str.replace("+","00")
# Remove dashes
phones["Phone number"] = phones["Phone number"].str.replace("-","")
# Set numbers with fewer than 10 digits to NaN
phones.loc[phones['Phone number'].str.len() < 10, "Phone number"] = np.nan

Overall, the guide provides a step‑by‑step workflow for cleaning raw data, preparing it for analysis, and ensuring high‑quality insights.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data cleaning Data preprocessing Pandas

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.