Fundamentals 13 min read

Master Pandas: Essential Data Preprocessing, Merging, and Analysis Techniques

This article provides a comprehensive, step‑by‑step guide to using pandas for data preprocessing, merging, indexing, sorting, grouping, extraction, filtering, sampling, and statistical analysis, complete with clear code examples and visual results to help readers master essential data manipulation in Python.

Python Crawling & Data Mining

May 21, 2020

Master Pandas: Essential Data Preprocessing, Merging, and Analysis Techniques

5. Data Preprocessing

First create a data2 dataset:

data2=pd.DataFrame({
    "id":np.arange(102,105),
    "profit":[1,10,2]
})
data2

Output:

Then create a data3 dataset:

data3=pd.DataFrame({
    "id":np.arange(111,113),
    "money":[106,51]
})
data3

Output:

5.1 Data Merging

Using merge:

DataFrame.merge(self, right, how='inner', on=None)

right: the object to merge

on: column or index level name present in both DataFrames

how: merge type – left, right, inner (default), outer

Examples:

data_new=pd.merge(data,data2,on='id',how='inner')  # default inner join
data_new=pd.merge(data,data2,on='id',how='outer')  # outer join, NaN where missing
data.merge(data2,on='id',how='inner')

Output:

Using append: data.append(data2) # append rows Output:

Using join (requires suffixes to avoid column overlap):

data.join(data2, lsuffix='_data', rsuffix='_data2')

Output:

Using concat:

pandas.concat(objs, axis=0, ignore_index=False, keys=None)

Example concatenating three datasets horizontally:

data_new=pd.concat([data,data2,data3],axis=1,keys=['data','data2','data3'])
data_new

Output:

5.2 Setting Index Columns

data.set_index("id")  # set 'id' as index

Output:

data.reset_index(drop=True)  # reset index without adding old index as column

Output:

5.3 Sorting by Specific Columns

data.sort_index()

Sorting by money column:

data.sort_values(by="money", ascending=True)

Output:

5.4 Conditional Classification

data['level'] = np.where(data['money']>=10, 'high', 'low')
data

Output:

5.5 Group Tagging

data.loc[(data['level']=="high") & (data['origin']=="China"), "sign"] = "棒"
data

Output:

5.6 Splitting Date Columns

data_split = pd.DataFrame((x.split('-') for x in data['date']), index=data.index,
    columns=['year','month','day'])
data_split

Output:

Combine with original data:

pd.concat([data, data_split], axis=1)

Output:

6. Data Extraction

Using loc (label‑based) and iloc (position‑based) to retrieve rows, columns, and specific values.

# Single row by label
row = data.loc[6]
# Single row by position
row = data.iloc[6]
# Slice rows 2 to 4 (inclusive) by label
subset = data.loc[2:4]
# Slice rows 2 to 4 by position
subset = data.iloc[2:6]
# Retrieve a single value
value = data.loc[6, 'id']

7. Data Filtering

Logical AND, OR, NOT with loc:

# AND condition
filtered = data.loc[(data['origin']=='China') & (data['money']<35), ['id','date','money','product','department','origin']]
# OR condition
filtered = data.loc[(data['origin']=='China') | (data['money']<35), ['id','date','money','product','department','origin']]
# NOT condition (origin is China and money not less than 10)
filtered = data.loc[(data['origin']=='China') != (data['money']<10), ['id','date','money','product','department','origin']]

Using query for concise filtering:

data.query('department == "饮料"')
data.query('department == ["饮料","零食"]')

8. Data Aggregation

# Count rows per department
counts = data.groupby('department').count()
# Count ids per department
id_counts = data.groupby('department')['id'].count()
# Group by department and origin
grouped = data.groupby(['department','origin']).count()
# Aggregate money: count, sum, mean
agg = data.groupby('department')['money'].agg([len, np.sum, np.mean])

9. Statistical Summary

# Sample with replacement and weights
sample = data.sample(3, replace=True, weights=[0.1,0.1,0.2,0.2,0.1,0.1,0.1,0.1,0,0])
# Descriptive statistics (rounded, transposed)
stats = data.describe().round(2).T
# Standard deviation of 'money'
std_money = data['money'].std()
# Covariance matrix
cov_matrix = data.cov()
# Correlation matrix
corr_matrix = data.corr()

These examples demonstrate the core pandas operations for preparing, transforming, extracting, and summarizing data in Python.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data analysis Data preprocessing data merging

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.