Fundamentals 12 min read

Master Pandas: From Data Loading to Advanced Manipulation

This comprehensive Pandas tutorial walks you through loading CSV and Excel files, creating Series and DataFrames, performing basic operations, cleaning data, handling missing values, working with hierarchical indexes, grouping, merging, concatenating, and applying time‑series techniques, all illustrated with clear code examples and screenshots.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Master Pandas: From Data Loading to Advanced Manipulation

Pandas Introduction

This article presents a step‑by‑step Pandas tutorial covering data loading, creation, basic manipulation, cleaning, time‑series handling, and summary operations, using the sample file zlJob.csv as the source dataset.

Generating DataFrames

1.1 Reading Data

Read CSV files with pd.read_csv() and Excel files with pd.read_excel().

pd.read_csv('path/to/file.csv')
pd.read_excel('path/to/file.xlsx')

1.2 Creating Data

Create a Series (one‑dimensional) and a DataFrame (two‑dimensional).

s = pd.Series([1, 2, 3, 4, 5])
df2 = pd.DataFrame({
    "A": 1.0,
    "B": pd.Timestamp("20130102"),
    "C": pd.Series(1, index=list(range(4)), dtype="float32"),
    "D": np.array([3] * 4, dtype="int32"),
    "E": pd.Categorical(["test", "train", "test", "train"]),
    "F": "foo"
})
Series and DataFrame example
Series and DataFrame example

Basic DataFrame Operations

2.1 Viewing Data

Show the first five rows with data.head(). Get shape with data.shape and data types with data.dtypes. Check for nulls using data['name'].isnull().

data.head()
data.shape
data.dtypes
data['name'].isnull()

2.2 Row and Column Operations

Add a new row by appending a Series, delete a row with data.drop([990]), add a column with data['xx'] = range(len(data)), and delete a column with data.drop('序号', axis=1).

dic = {
    'name': '前端开发',
    'salary': '2万-2.5万',
    'company': '上海科技有限公司',
    'address': '上海',
    'eduBack': '本科',
    'companyType': '民营',
    'scale': '1000-10000人',
    'info': '小程序'
}
new_row = pd.Series(dic)
new_row.name = 38738
data = data.append(new_row)
data = data.drop([990])
data['xx'] = range(len(data))
data = data.drop('序号', axis=1)

2.3 Indexing

Label‑based indexing with .loc and position‑based indexing with .iloc. Examples include selecting a single cell, a column slice, and multi‑column slices.

data.loc[10, 'salary']
data.loc[:, 'name'][:5]
data.iloc[2]  # third row
data.iloc[:5, :4]  # first 5 rows, first 4 columns

2.4 Hierarchical Indexing

Create a multi‑level Series and DataFrame.

s = pd.Series(np.arange(1, 10), index=[list('aaabbccdd'), [1,2,3,1,2,3,1,2,3]])
df = pd.DataFrame(np.arange(12).reshape(4,3),
    index=[["a","a","b","b"],[1,2,1,2]],
    columns=[["X","X","Y"],["m","n","t"]])
Hierarchical index example
Hierarchical index example

Data Pre‑processing

3.1 Handling Missing Values

Create a simple DataFrame, detect missing values with .isnull(), fill them with .fillna(0, inplace=True), and drop rows or columns that are entirely null.

df = pd.DataFrame({
    'state': ['a','b','c','d'],
    'year': [1991,1992,1993,1994],
    'pop': [6.0,7.0,8.0,np.NaN]
})
df['pop'].isnull()
df['pop'].fillna(0, inplace=True)
df.dropna(how='all')

3.2 String Processing

Trim spaces and convert case.

df['A'] = df['A'].str.strip()
df['A'] = df['A'].str.lower()

3.3 Duplicate Handling

Remove duplicate values, keeping either the last or first occurrence, and replace specific values.

df['A'] = df['A'].drop_duplicates()
df['A'] = df['A'].drop_duplicates(keep='last')
df['A'].replace('sh', 'shanghai')

DataFrame Operations

Grouping

Group by a column (e.g., job name) and perform aggregate calculations.

group = data.groupby(data['name'])

Concatenation

Combine multiple DataFrames vertically.

frames = [df1, df2, df3]
result = pd.concat(frames)

Merge

Merge two DataFrames on a common key.

result = pd.merge(left, right, on='key')

Time Series

5.1 Generating a Period Range

date = pd.period_range(start='20210913', end='20210919')

5.2 Using Time Series in a DataFrame

index = pd.period_range(start='20210913', end='20210918')
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=index)
Time series DataFrame
Time series DataFrame

Summary

The tutorial demonstrates common Pandas operations—loading data, creating Series/DataFrames, basic inspection, cleaning, hierarchical indexing, grouping, merging, concatenating, and time‑series handling—using the sample zlJob.csv. For detailed explanations, refer to the official Pandas documentation.

PythonDataFramepandastime-seriesdata-cleaning
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.