Master Data Cleaning in Pandas: 20 Essential Scripts for E‑Commerce Sales Analysis
This guide walks you through creating a sample e‑commerce sales Excel file with Pandas and then demonstrates twenty practical data‑cleaning and transformation scripts—including handling missing values, renaming columns, filtering, grouping, and exporting—so you can efficiently prepare sales data for analysis.
This article provides a complete, step‑by‑step tutorial for generating a sample e‑commerce sales dataset and performing a wide range of data‑cleaning operations using Python's pandas library.
1. Create sales.xlsx from scratch
import pandas as pd
from datetime import datetime
# Define sample data
data = {
'OrderID': [1, 2, 3, 4, 5],
'ProductName': ['Phone', 'Laptop', 'T‑shirt', 'Headphones', 'Jeans'],
'Category': ['Electronics', 'Electronics', 'Apparel', 'Electronics', 'Apparel'],
'Sales': [2999, 5999, 99, 199, 299],
'Quantity': [1, 1, 2, 1, None], # intentional missing value
'Date': [datetime(2025, 1, 1), datetime(2025, 1, 2), None,
datetime(2025, 1, 3), datetime(2025, 1, 4)] # intentional missing value
}
# Create DataFrame
df = pd.DataFrame(data)
# Save to Excel (no index, custom sheet name)
df.to_excel('sales.xlsx', index=False, sheet_name='Sales Data')
print("Excel file 'sales.xlsx' created successfully.")Before running the script, ensure pandas and openpyxl are installed:
pip install pandas openpyxl2. Load the Excel file
import pandas as pd
# Read the Excel file
df = pd.read_excel('sales.xlsx')
print("Original data:")
print(df)3. Inspect basic information
print("Dataframe info:")
print(df.info())4. View the first few rows
print("First 5 rows:")
print(df.head())5. Drop columns with more than half missing values
half_count = len(df) / 2
df_cleaned = df.dropna(thresh=half_count, axis=1)
print("After dropping columns with >50% missing values:")
print(df_cleaned)6. Remove rows containing any missing values
df_cleaned = df.dropna()
print("After dropping rows with any missing values:")
print(df_cleaned)7. Fill missing values with specific constants
df_filled = df.fillna({'Quantity': 0, 'Date': 'Unknown'})
print("After filling missing values with constants:")
print(df_filled)8. Fill numeric missing values with column mean
df['Quantity'].fillna(df['Quantity'].mean(), inplace=True)
print("After filling numeric missing values with mean:")
print(df)9. Remove duplicate rows
df_unique = df.drop_duplicates()
print("After removing duplicate rows:")
print(df_unique)10. Rename columns
df.rename(columns={'Sales': 'Sales', 'Quantity': 'Quantity', 'Date': 'Date'}, inplace=True)
# Example of renaming Chinese column names to English
# df.rename(columns={'销售额': 'Sales', '数量': 'Quantity', '日期': 'Date'}, inplace=True)
print("After renaming columns:")
print(df)11. Convert the date column to proper datetime type
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
print("After converting 'Date' to datetime:")
print(df)12. Filter rows by a condition (Sales ≥ 2000)
filtered_df = df[df['Sales'] >= 2000]
print("Products with Sales ≥ 2000:")
print(filtered_df)13. Group by category and compute average Sales and Quantity
grouped = df.groupby('Category').mean()
print("Average Sales and Quantity by Category:")
print(grouped)14. Create a new column (Total Amount = Sales × Quantity)
df['Total Amount'] = df['Sales'] * df['Quantity']
print("After adding 'Total Amount' column:")
print(df)15. Delete the newly created column
df.drop(columns=['Total Amount'], inplace=True)
print("After dropping 'Total Amount' column:")
print(df)16. Reset the index
df_reset = df.reset_index(drop=True)
print("After resetting index:")
print(df_reset)17. Replace specific categorical values
# Replace Chinese category name with English equivalent
df['Category'] = df['Category'].replace('电子产品', 'Electronics')
print("After replacing '电子产品' with 'Electronics':")
print(df)18. Apply a custom function to grade Sales
def sales_grade(sales):
if sales >= 5000:
return 'High'
elif sales >= 2000:
return 'Medium'
else:
return 'Low'
df['Sales Grade'] = df['Sales'].apply(sales_grade)
print("After applying sales grading function:")
print(df)19. Check unique values in a column
unique_categories = df['Category'].unique()
print("Unique values in 'Category':")
print(unique_categories)20. Filter by multiple conditions (Sales > 2000 and Quantity > 1)
filtered_df = df[(df['Sales'] > 2000) & (df['Quantity'] > 1)]
print("Products with Sales > 2000 and Quantity > 1:")
print(filtered_df)21. Sort data by Sales descending
sorted_df = df.sort_values(by='Sales', ascending=False)
print("Data sorted by Sales (high to low):")
print(sorted_df)These twenty independent scripts cover the full workflow from generating a sample e‑commerce sales dataset to performing common data‑cleaning, transformation, and analysis tasks. Ensure your Python environment has pandas and openpyxl installed before executing the examples.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
