Master Pandas: From Data Import to Visualization in Python
This tutorial walks through using pandas for data import, inspection, preprocessing, analysis, and visualization, providing practical code snippets and explanations to help readers understand essential data manipulation techniques in Python.
Preface: This is an initial draft of personal insights on using pandas for data processing and analysis, which may contain errors; feedback is welcome.
First, import the required libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pltThe article demonstrates six aspects of data manipulation with pandas.
1. Import Data
Generate Data Manually
Pandas provides two main data structures: Series and DataFrame, which can be used to create array‑like objects.
pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)Load External Data
In practice, loading external data is more common. Pandas offers functions to read tabular data into a DataFrame, such as pd.read_excel() and pd.read_csv(). Example:
data = pd.read_excel('F:/tianmao data/computer data.xlsx')These functions have many parameters; you can view help with help(pd.read_excel).
2. Review Data
After loading, inspect the data to understand its structure, types, and size.
First 5 rows: data.head() Last 5 rows: data.tail() Shape: data.shape Data types: data.dtypes Info: data.info() Column selection: data['column_name'] or data.loc[], data.iloc[] Sorting: data = data.sort_values(by=['月销量'], ascending=False) Descriptive statistics:
data.describe()3. Data Preprocessing
(1) Data Integration
Combine datasets using pandas.merge(), pandas.concat(), DataFrame.join(), DataFrame.combine_first(), or DataFrame.update().
pandas.merge(left, right, how='inner', on=None, ...) pandas.concat(objs, axis=0, join='outer', ...) DataFrame.join(other, on=None, how='left', ...) DataFrame.update(other, join='left', ...)(2) Data Cleaning
String processing, type conversion, missing‑value handling, duplicate removal, and outlier detection.
# Split and extract numeric parts
data['当前时间'] = data['当前时间'].str.split(' ', expand=True)[0]
data['收藏'] = data['收藏'].str.extract('(\d+)')
data['库存'] = data['库存'].str.extract('(\d+)')
data['天猫积分'] = data['天猫积分'].str.extract('(\d+)')
data['现价'] = data['现价'].str.split('-', expand=True)[0]
data['原价'] = data['原价'].str.split('-', expand=True)[0]
# Convert types
data['现价'] = data['现价'].astype(np.float64)
data['原价'] = data['原价'].astype(np.float64)
data['收藏'] = data['收藏'].astype(np.float64)
data['库存'] = data['库存'].astype(np.float64)
data['天猫积分'] = data['天猫积分'].astype(np.float64)
# Missing values
data.isnull().any()
# Drop rows where '月销量' is missing
data.dropna(subset=['月销量'])
# Fill missing '月销量' with mean
data['月销量'].fillna(data['月销量'].mean())
# Remove duplicates based on store name
data.drop_duplicates(subset=['店铺名称'], keep='first')(3) Data Transformation
Apply functions, standardize, and discretize.
# Log‑like transform (example using sqrt)
data['对数_累计评价'] = np.sqrt(data['累计评价'])
# Normalization
data['标准化月销量'] = data['月销量'].transform(lambda x: (x - x.min())/(x.max() - x.min()))
# Z‑score standardization
data['Z_月销量'] = data['月销量'].transform(lambda x: (x - x.mean())/x.std())
# Binning
data['data_cut'] = pd.cut(data['月销量'], bins=10, labels=range(1,11))
data['data_qcut'] = pd.qcut(data['月销量'], q=10, labels=range(1,11))(4) Data Reduction
Sample rows to reduce size:
data.sample(n=100)4. Data Analysis
(1) Descriptive Analysis
Calculate central tendency, dispersion, and distribution shape.
def f(x):
return pd.DataFrame([x.mean(), x.median(), x.mode(), x.quantile(0.25), x.quantile(0.75)],
index=['mean','median','mode','Q1','Q3'])
def k(x):
return pd.DataFrame([x.var(), x.std(), x.max()-x.min(), x.quantile(0.75)-x.quantile(0.25)],
index=['var','std','range','IQR'])
def g(x):
return pd.DataFrame([x.skew(), x.kurt()], index=['skew','kurt'])(2) Correlation Analysis
Use Pearson correlation, covariance, and correlation with a target variable.
data.corr()
data.cov()
data['月销量'].corr(data['累计评价'])
data.corrwith(data['月销量'])5. Data Visualization
Exploratory Plots
Histograms, boxplots, bar charts, and scatter plots illustrate distributions and relationships.
# Histogram example
plt.hist(data['月销量'])
# Boxplot example
plt.boxplot(data['月销量'])
# Scatter plot example
plt.scatter(data1['现价'], data1['月销量'])These visualizations reveal that the data does not follow a normal distribution and contains significant outliers, highlighting the importance of preprocessing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
