Fundamentals 17 min read

Comprehensive Guide to Data Processing, Cleaning, and Visualization with Pandas

This tutorial walks through using pandas to import, review, preprocess (including integration, cleaning, transformation, handling missing and duplicate values, outlier detection, and sampling), analyze (descriptive statistics and correlation), and visualize e‑commerce data with Python, providing practical code examples for each step.

Python Programming Learning Circle

Jan 22, 2022

Comprehensive Guide to Data Processing, Cleaning, and Visualization with Pandas

1. Import Required Libraries

First, import the essential Python libraries for data handling and visualization:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

2. Load Data

Data can be generated manually or loaded from external sources such as Excel files. The example uses pd.read_excel('F:/tianmao data/computer data.xlsx') to read e‑commerce data scraped from Tmall.

3. Data Review

After loading, inspect the dataset to understand its structure, types, and size using commands like data.head(), data.tail(), data.shape, data.dtypes, data.info(), and column‑specific queries.

4. Data Preprocessing

4.1 Data Integration

Combine multiple DataFrames using pandas.merge, pandas.concat, DataFrame.join, DataFrame.combine_first, or DataFrame.update depending on the required join type and key columns.

4.2 Data Cleaning

String manipulation extracts numeric values from columns like "收藏", "库存", and splits date strings. Convert data types with astype(np.float64) and parse dates using pd.to_datetime.

4.3 Missing Value Handling

Identify missing values with data.isnull().any(), locate rows containing NaNs, drop them using data.dropna(), or fill them with statistical values such as mean or median via data['月销量'].fillna(data['月销量'].mean()).

4.4 Duplicate Handling

Detect duplicates with data.duplicated() and remove them using data.drop_duplicates().

4.5 Outlier Detection

Apply the 3‑sigma rule or box‑plot method to flag outliers, creating a boolean column (e.g., data['three_sigma']) and filtering normal data accordingly.

4.6 Sampling and Feature Reduction

Randomly sample rows with data.sample(n=100) to reduce dataset size. Feature reduction techniques such as discretization ( pd.cut, pd.qcut) and future dimensionality reduction are mentioned.

5. Data Analysis

5.1 Descriptive Statistics

Define functions to compute central tendency (mean, median, mode, quartiles) and dispersion (variance, standard deviation, range, IQR). Also calculate skewness and kurtosis to assess distribution shape.

5.2 Correlation Analysis

Use data.corr() for Pearson correlation matrix, data.cov() for covariance, and data.corrwith() to examine relationships between the target variable and all features.

6. Data Visualization

6.1 Exploratory Plots

Generate histograms for each numeric column, box‑plots to reveal outliers, and scatter plots to visualize relationships (e.g., price vs. monthly sales, discount ratio vs. sales).

from pylab import *
plt.hist(data['月销量'])
plt.boxplot(data['月销量'])
plt.scatter(data1['现价'], data1['月销量'])
plt.show()

6.2 Result Presentation

Bar charts illustrate categorical comparisons, while scatter plots highlight trends between variables such as price, discount, and sales. The visualizations confirm that the data are non‑normal, contain significant variance, and benefit from thorough preprocessing.

Overall, the guide demonstrates a complete workflow from raw e‑commerce data to cleaned, transformed, analyzed, and visualized insights using pandas and matplotlib.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data cleaning Pandas data-analysis Preprocessing data-visualization

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.