Step-by-Step Guide to Data Cleaning with Pandas
This tutorial explains how to use Python's Pandas library to import data, explore its structure, handle missing values, outliers, duplicates, inconsistent formats, special characters, and time‑series features, and finally save the cleaned dataset while providing best practices for ensuring data‑cleaning quality.
Data cleaning is a crucial step in data analysis and machine learning, involving the handling of missing values, outliers, duplicate records, and inconsistent data; Pandas provides a comprehensive set of tools to perform these tasks efficiently.
1. Import required libraries
import pandas as pd</code>
<code>import numpy as np2. Load data
# Load data</code>
<code>df = pd.read_csv('data.csv')</code>
<code>print(df.head())3. Explore data
# View basic info</code>
<code>print(df.info())</code>
<code># View descriptive statistics</code>
<code>print(df.describe())</code>
<code># Check missing values</code>
<code>print(df.isnull().sum())4. Handle missing values
# Drop rows with missing values</code>
<code>df = df.dropna()</code>
<code># Drop columns with missing values</code>
<code>df = df.dropna(axis=1)</code>
<code># Fill numeric column with mean</code>
<code>df['Age'] = df['Age'].fillna(df['Age'].mean())</code>
<code># Fill categorical column with mode</code>
<code>df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])</code>
<code># Fill with a specific value</code>
<code>df['Income'] = df['Income'].fillna(0)</code>
<code># Forward fill</code>
<code>df['Salary'] = df['Salary'].fillna(method='ffill')</code>
<code># Backward fill</code>
<code>df['Salary'] = df['Salary'].fillna(method='bfill')</code>
<code># Interpolate</code>
<code>df['Temperature'] = df['Temperature'].interpolate()5. Handle outliers
# Remove records where Age > 100</code>
<code>df = df[df['Age'] <= 100]</code>
<code># Z‑score method</code>
<code>from scipy import stats</code>
<code>z_scores = np.abs(stats.zscore(df.select_dtypes(include=[np.number])))</code>
<code># Remove records with Z‑score > 3</code>
<code>df = df[(z_scores < 3).all(axis=1)]</code>
<code># IQR method</code>
<code>Q1 = df.quantile(0.25)</code>
<code>Q3 = df.quantile(0.75)</code>
<code>IQR = Q3 - Q1</code>
<code># Remove records outside IQR range</code>
<code>df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]6. Remove duplicate records
# Drop duplicates</code>
<code>df = df.drop_duplicates()7. Convert data types
# Convert string to datetime</code>
<code>df['Date'] = pd.to_datetime(df['Date'])</code>
<code># Convert object to numeric</code>
<code>df['Age'] = pd.to_numeric(df['Age'], errors='coerce')</code>
<code># Convert numeric to categorical</code>
<code>df['Category'] = df['Category'].astype('category')8. Standardize inconsistent data
# Lowercase all text</code>
<code>df['Name'] = df['Name'].str.lower()</code>
<code># Strip whitespace</code>
<code>df['Name'] = df['Name'].str.strip()</code>
<code># Replace specific values</code>
<code>df['City'] = df['City'].replace({'New York City': 'New York', 'LA': 'Los Angeles'})9. Remove special characters
# Remove special characters</code>
<code>df['Comment'] = df['Comment'].str.replace('[^\w\s]', '', regex=True)10. Process time‑series data
# Extract year, month, day</code>
<code>df['Year'] = df['Date'].dt.year</code>
<code>df['Month'] = df['Date'].dt.month</code>
<code>df['Day'] = df['Date'].dt.day11. Save the cleaned dataset
# Save cleaned data</code>
<code>df.to_csv('cleaned_data.csv', index=False)Ensuring data‑cleaning effectiveness
Define clear data‑quality standards (completeness, consistency, accuracy, validity), conduct thorough data exploration, log each cleaning step, perform cleaning in staged phases, use assertions and unit tests, regularly review data, employ version control (Git), communicate with team members, automate pipelines with tools like Airflow or Prefect, and set up continuous data‑quality monitoring.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
