Comprehensive Guide to Data Processing, Cleaning, and Visualization with Pandas
This tutorial walks through using pandas to import, review, preprocess (including integration, cleaning, transformation, handling missing and duplicate values, outlier detection, and sampling), analyze (descriptive statistics and correlation), and visualize e‑commerce data with Python, providing practical code examples for each step.
1. Import Required Libraries
First, import the essential Python libraries for data handling and visualization:
<code>import numpy as np
import pandas as pd
import matplotlib.pyplot as plt</code>2. Load Data
Data can be generated manually or loaded from external sources such as Excel files. The example uses pd.read_excel('F:/tianmao data/computer data.xlsx') to read e‑commerce data scraped from Tmall.
3. Data Review
After loading, inspect the dataset to understand its structure, types, and size using commands like data.head() , data.tail() , data.shape , data.dtypes , data.info() , and column‑specific queries.
4. Data Preprocessing
4.1 Data Integration
Combine multiple DataFrames using pandas.merge , pandas.concat , DataFrame.join , DataFrame.combine_first , or DataFrame.update depending on the required join type and key columns.
4.2 Data Cleaning
String manipulation extracts numeric values from columns like "收藏", "库存", and splits date strings. Convert data types with astype(np.float64) and parse dates using pd.to_datetime .
4.3 Missing Value Handling
Identify missing values with data.isnull().any() , locate rows containing NaNs, drop them using data.dropna() , or fill them with statistical values such as mean or median via data['月销量'].fillna(data['月销量'].mean()) .
4.4 Duplicate Handling
Detect duplicates with data.duplicated() and remove them using data.drop_duplicates() .
4.5 Outlier Detection
Apply the 3‑sigma rule or box‑plot method to flag outliers, creating a boolean column (e.g., data['three_sigma'] ) and filtering normal data accordingly.
4.6 Sampling and Feature Reduction
Randomly sample rows with data.sample(n=100) to reduce dataset size. Feature reduction techniques such as discretization ( pd.cut , pd.qcut ) and future dimensionality reduction are mentioned.
5. Data Analysis
5.1 Descriptive Statistics
Define functions to compute central tendency (mean, median, mode, quartiles) and dispersion (variance, standard deviation, range, IQR). Also calculate skewness and kurtosis to assess distribution shape.
5.2 Correlation Analysis
Use data.corr() for Pearson correlation matrix, data.cov() for covariance, and data.corrwith() to examine relationships between the target variable and all features.
6. Data Visualization
6.1 Exploratory Plots
Generate histograms for each numeric column, box‑plots to reveal outliers, and scatter plots to visualize relationships (e.g., price vs. monthly sales, discount ratio vs. sales).
<code>from pylab import *
plt.hist(data['月销量'])
plt.boxplot(data['月销量'])
plt.scatter(data1['现价'], data1['月销量'])
plt.show()</code>6.2 Result Presentation
Bar charts illustrate categorical comparisons, while scatter plots highlight trends between variables such as price, discount, and sales. The visualizations confirm that the data are non‑normal, contain significant variance, and benefit from thorough preprocessing.
Overall, the guide demonstrates a complete workflow from raw e‑commerce data to cleaned, transformed, analyzed, and visualized insights using pandas and matplotlib.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.