Fundamentals 9 min read

Step-by-Step Guide to Data Cleaning with Pandas

This tutorial explains how to use Python's Pandas library to import data, explore its structure, handle missing values, outliers, duplicates, inconsistent formats, special characters, and time‑series features, and finally save the cleaned dataset while providing best practices for ensuring data‑cleaning quality.

Test Development Learning Exchange

Oct 28, 2024

Step-by-Step Guide to Data Cleaning with Pandas

Data cleaning is a crucial step in data analysis and machine learning, involving the handling of missing values, outliers, duplicate records, and inconsistent data; Pandas provides a comprehensive set of tools to perform these tasks efficiently.

1. Import required libraries

import pandas as pd</code>
<code>import numpy as np

2. Load data

# Load data</code>
<code>df = pd.read_csv('data.csv')</code>
<code>print(df.head())

3. Explore data

# View basic info</code>
<code>print(df.info())</code>
<code># View descriptive statistics</code>
<code>print(df.describe())</code>
<code># Check missing values</code>
<code>print(df.isnull().sum())

4. Handle missing values

# Drop rows with missing values</code>
<code>df = df.dropna()</code>
<code># Drop columns with missing values</code>
<code>df = df.dropna(axis=1)</code>
<code># Fill numeric column with mean</code>
<code>df['Age'] = df['Age'].fillna(df['Age'].mean())</code>
<code># Fill categorical column with mode</code>
<code>df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])</code>
<code># Fill with a specific value</code>
<code>df['Income'] = df['Income'].fillna(0)</code>
<code># Forward fill</code>
<code>df['Salary'] = df['Salary'].fillna(method='ffill')</code>
<code># Backward fill</code>
<code>df['Salary'] = df['Salary'].fillna(method='bfill')</code>
<code># Interpolate</code>
<code>df['Temperature'] = df['Temperature'].interpolate()

5. Handle outliers

# Remove records where Age > 100</code>
<code>df = df[df['Age'] <= 100]</code>
<code># Z‑score method</code>
<code>from scipy import stats</code>
<code>z_scores = np.abs(stats.zscore(df.select_dtypes(include=[np.number])))</code>
<code># Remove records with Z‑score > 3</code>
<code>df = df[(z_scores < 3).all(axis=1)]</code>
<code># IQR method</code>
<code>Q1 = df.quantile(0.25)</code>
<code>Q3 = df.quantile(0.75)</code>
<code>IQR = Q3 - Q1</code>
<code># Remove records outside IQR range</code>
<code>df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

6. Remove duplicate records

# Drop duplicates</code>
<code>df = df.drop_duplicates()

7. Convert data types

# Convert string to datetime</code>
<code>df['Date'] = pd.to_datetime(df['Date'])</code>
<code># Convert object to numeric</code>
<code>df['Age'] = pd.to_numeric(df['Age'], errors='coerce')</code>
<code># Convert numeric to categorical</code>
<code>df['Category'] = df['Category'].astype('category')

8. Standardize inconsistent data

# Lowercase all text</code>
<code>df['Name'] = df['Name'].str.lower()</code>
<code># Strip whitespace</code>
<code>df['Name'] = df['Name'].str.strip()</code>
<code># Replace specific values</code>
<code>df['City'] = df['City'].replace({'New York City': 'New York', 'LA': 'Los Angeles'})

9. Remove special characters

# Remove special characters</code>
<code>df['Comment'] = df['Comment'].str.replace('[^\w\s]', '', regex=True)

10. Process time‑series data

# Extract year, month, day</code>
<code>df['Year'] = df['Date'].dt.year</code>
<code>df['Month'] = df['Date'].dt.month</code>
<code>df['Day'] = df['Date'].dt.day

11. Save the cleaned dataset

# Save cleaned data</code>
<code>df.to_csv('cleaned_data.csv', index=False)

Ensuring data‑cleaning effectiveness

Define clear data‑quality standards (completeness, consistency, accuracy, validity), conduct thorough data exploration, log each cleaning step, perform cleaning in staged phases, use assertions and unit tests, regularly review data, employ version control (Git), communicate with team members, automate pipelines with tools like Airflow or Prefect, and set up continuous data‑quality monitoring.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.