Fundamentals 9 min read

Master Data Cleaning in Pandas: 20 Essential Scripts for E‑Commerce Sales Analysis

This guide walks you through creating a sample e‑commerce sales Excel file with Pandas and then demonstrates twenty practical data‑cleaning and transformation scripts—including handling missing values, renaming columns, filtering, grouping, and exporting—so you can efficiently prepare sales data for analysis.

Test Development Learning Exchange

Feb 11, 2025

Master Data Cleaning in Pandas: 20 Essential Scripts for E‑Commerce Sales Analysis

This article provides a complete, step‑by‑step tutorial for generating a sample e‑commerce sales dataset and performing a wide range of data‑cleaning operations using Python's pandas library.

1. Create sales.xlsx from scratch

import pandas as pd
from datetime import datetime
# Define sample data
data = {
    'OrderID': [1, 2, 3, 4, 5],
    'ProductName': ['Phone', 'Laptop', 'T‑shirt', 'Headphones', 'Jeans'],
    'Category': ['Electronics', 'Electronics', 'Apparel', 'Electronics', 'Apparel'],
    'Sales': [2999, 5999, 99, 199, 299],
    'Quantity': [1, 1, 2, 1, None],  # intentional missing value
    'Date': [datetime(2025, 1, 1), datetime(2025, 1, 2), None,
             datetime(2025, 1, 3), datetime(2025, 1, 4)]  # intentional missing value
}
# Create DataFrame
df = pd.DataFrame(data)
# Save to Excel (no index, custom sheet name)
df.to_excel('sales.xlsx', index=False, sheet_name='Sales Data')
print("Excel file 'sales.xlsx' created successfully.")

Before running the script, ensure pandas and openpyxl are installed:

pip install pandas openpyxl

2. Load the Excel file

import pandas as pd
# Read the Excel file
df = pd.read_excel('sales.xlsx')
print("Original data:")
print(df)

3. Inspect basic information

print("Dataframe info:")
print(df.info())

4. View the first few rows

print("First 5 rows:")
print(df.head())

5. Drop columns with more than half missing values

half_count = len(df) / 2
df_cleaned = df.dropna(thresh=half_count, axis=1)
print("After dropping columns with >50% missing values:")
print(df_cleaned)

6. Remove rows containing any missing values

df_cleaned = df.dropna()
print("After dropping rows with any missing values:")
print(df_cleaned)

7. Fill missing values with specific constants

df_filled = df.fillna({'Quantity': 0, 'Date': 'Unknown'})
print("After filling missing values with constants:")
print(df_filled)

8. Fill numeric missing values with column mean

df['Quantity'].fillna(df['Quantity'].mean(), inplace=True)
print("After filling numeric missing values with mean:")
print(df)

9. Remove duplicate rows

df_unique = df.drop_duplicates()
print("After removing duplicate rows:")
print(df_unique)

10. Rename columns

df.rename(columns={'Sales': 'Sales', 'Quantity': 'Quantity', 'Date': 'Date'}, inplace=True)
# Example of renaming Chinese column names to English
# df.rename(columns={'销售额': 'Sales', '数量': 'Quantity', '日期': 'Date'}, inplace=True)
print("After renaming columns:")
print(df)

11. Convert the date column to proper datetime type

df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
print("After converting 'Date' to datetime:")
print(df)

12. Filter rows by a condition (Sales ≥ 2000)

filtered_df = df[df['Sales'] >= 2000]
print("Products with Sales ≥ 2000:")
print(filtered_df)

13. Group by category and compute average Sales and Quantity

grouped = df.groupby('Category').mean()
print("Average Sales and Quantity by Category:")
print(grouped)

14. Create a new column (Total Amount = Sales × Quantity)

df['Total Amount'] = df['Sales'] * df['Quantity']
print("After adding 'Total Amount' column:")
print(df)

15. Delete the newly created column

df.drop(columns=['Total Amount'], inplace=True)
print("After dropping 'Total Amount' column:")
print(df)

16. Reset the index

df_reset = df.reset_index(drop=True)
print("After resetting index:")
print(df_reset)

17. Replace specific categorical values

# Replace Chinese category name with English equivalent
df['Category'] = df['Category'].replace('电子产品', 'Electronics')
print("After replacing '电子产品' with 'Electronics':")
print(df)

18. Apply a custom function to grade Sales

def sales_grade(sales):
    if sales >= 5000:
        return 'High'
    elif sales >= 2000:
        return 'Medium'
    else:
        return 'Low'

df['Sales Grade'] = df['Sales'].apply(sales_grade)
print("After applying sales grading function:")
print(df)

19. Check unique values in a column

unique_categories = df['Category'].unique()
print("Unique values in 'Category':")
print(unique_categories)

20. Filter by multiple conditions (Sales > 2000 and Quantity > 1)

filtered_df = df[(df['Sales'] > 2000) & (df['Quantity'] > 1)]
print("Products with Sales > 2000 and Quantity > 1:")
print(filtered_df)

21. Sort data by Sales descending

sorted_df = df.sort_values(by='Sales', ascending=False)
print("Data sorted by Sales (high to low):")
print(sorted_df)

These twenty independent scripts cover the full workflow from generating a sample e‑commerce sales dataset to performing common data‑cleaning, transformation, and analysis tasks. Ensure your Python environment has pandas and openpyxl installed before executing the examples.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

e-commerce data analysis data cleaning Excel Pandas

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.