Fundamentals 19 min read

Master Pandas: From Data Import to Visualization in Python

This tutorial walks through using pandas for data import, inspection, preprocessing, analysis, and visualization, providing practical code snippets and explanations to help readers understand essential data manipulation techniques in Python.

MaGe Linux Operations

Aug 23, 2020

Master Pandas: From Data Import to Visualization in Python

Preface: This is an initial draft of personal insights on using pandas for data processing and analysis, which may contain errors; feedback is welcome.

First, import the required libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

The article demonstrates six aspects of data manipulation with pandas.

1. Import Data

Generate Data Manually

Pandas provides two main data structures: Series and DataFrame, which can be used to create array‑like objects.

pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

Load External Data

In practice, loading external data is more common. Pandas offers functions to read tabular data into a DataFrame, such as pd.read_excel() and pd.read_csv(). Example:

data = pd.read_excel('F:/tianmao data/computer data.xlsx')

These functions have many parameters; you can view help with help(pd.read_excel).

2. Review Data

After loading, inspect the data to understand its structure, types, and size.

First 5 rows: data.head() Last 5 rows: data.tail() Shape: data.shape Data types: data.dtypes Info: data.info() Column selection: data['column_name'] or data.loc[], data.iloc[] Sorting: data = data.sort_values(by=['月销量'], ascending=False) Descriptive statistics:

data.describe()

3. Data Preprocessing

(1) Data Integration

Combine datasets using pandas.merge(), pandas.concat(), DataFrame.join(), DataFrame.combine_first(), or DataFrame.update().

pandas.merge(left, right, how='inner', on=None, ...)

pandas.concat(objs, axis=0, join='outer', ...)

DataFrame.join(other, on=None, how='left', ...)

DataFrame.update(other, join='left', ...)

(2) Data Cleaning

String processing, type conversion, missing‑value handling, duplicate removal, and outlier detection.

# Split and extract numeric parts
data['当前时间'] = data['当前时间'].str.split(' ', expand=True)[0]
data['收藏'] = data['收藏'].str.extract('(\d+)')
data['库存'] = data['库存'].str.extract('(\d+)')
data['天猫积分'] = data['天猫积分'].str.extract('(\d+)')
data['现价'] = data['现价'].str.split('-', expand=True)[0]
data['原价'] = data['原价'].str.split('-', expand=True)[0]

# Convert types
data['现价'] = data['现价'].astype(np.float64)
data['原价'] = data['原价'].astype(np.float64)
data['收藏'] = data['收藏'].astype(np.float64)
data['库存'] = data['库存'].astype(np.float64)
data['天猫积分'] = data['天猫积分'].astype(np.float64)

# Missing values
data.isnull().any()
# Drop rows where '月销量' is missing
data.dropna(subset=['月销量'])
# Fill missing '月销量' with mean
data['月销量'].fillna(data['月销量'].mean())

# Remove duplicates based on store name
data.drop_duplicates(subset=['店铺名称'], keep='first')

(3) Data Transformation

Apply functions, standardize, and discretize.

# Log‑like transform (example using sqrt)
data['对数_累计评价'] = np.sqrt(data['累计评价'])
# Normalization
data['标准化月销量'] = data['月销量'].transform(lambda x: (x - x.min())/(x.max() - x.min()))
# Z‑score standardization
data['Z_月销量'] = data['月销量'].transform(lambda x: (x - x.mean())/x.std())
# Binning
data['data_cut'] = pd.cut(data['月销量'], bins=10, labels=range(1,11))
data['data_qcut'] = pd.qcut(data['月销量'], q=10, labels=range(1,11))

(4) Data Reduction

Sample rows to reduce size:

data.sample(n=100)

4. Data Analysis

(1) Descriptive Analysis

Calculate central tendency, dispersion, and distribution shape.

def f(x):
    return pd.DataFrame([x.mean(), x.median(), x.mode(), x.quantile(0.25), x.quantile(0.75)],
                        index=['mean','median','mode','Q1','Q3'])

def k(x):
    return pd.DataFrame([x.var(), x.std(), x.max()-x.min(), x.quantile(0.75)-x.quantile(0.25)],
                        index=['var','std','range','IQR'])

def g(x):
    return pd.DataFrame([x.skew(), x.kurt()], index=['skew','kurt'])

(2) Correlation Analysis

Use Pearson correlation, covariance, and correlation with a target variable.

data.corr()
data.cov()
data['月销量'].corr(data['累计评价'])
data.corrwith(data['月销量'])

5. Data Visualization

Exploratory Plots

Histograms, boxplots, bar charts, and scatter plots illustrate distributions and relationships.

# Histogram example
plt.hist(data['月销量'])
# Boxplot example
plt.boxplot(data['月销量'])
# Scatter plot example
plt.scatter(data1['现价'], data1['月销量'])

These visualizations reveal that the data does not follow a normal distribution and contains significant outliers, highlighting the importance of preprocessing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data preprocessing Pandas

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.