Fundamentals 12 min read

Master Pandas: From Data Loading to Advanced Manipulation

This comprehensive Pandas tutorial walks you through loading CSV and Excel files, creating Series and DataFrames, performing basic operations, cleaning data, handling missing values, working with hierarchical indexes, grouping, merging, concatenating, and applying time‑series techniques, all illustrated with clear code examples and screenshots.

Python Crawling & Data Mining

Jan 23, 2022

Master Pandas: From Data Loading to Advanced Manipulation

Pandas Introduction

This article presents a step‑by‑step Pandas tutorial covering data loading, creation, basic manipulation, cleaning, time‑series handling, and summary operations, using the sample file zlJob.csv as the source dataset.

Generating DataFrames

1.1 Reading Data

Read CSV files with pd.read_csv() and Excel files with pd.read_excel().

pd.read_csv('path/to/file.csv')

pd.read_excel('path/to/file.xlsx')

1.2 Creating Data

Create a Series (one‑dimensional) and a DataFrame (two‑dimensional).

s = pd.Series([1, 2, 3, 4, 5])

df2 = pd.DataFrame({
    "A": 1.0,
    "B": pd.Timestamp("20130102"),
    "C": pd.Series(1, index=list(range(4)), dtype="float32"),
    "D": np.array([3] * 4, dtype="int32"),
    "E": pd.Categorical(["test", "train", "test", "train"]),
    "F": "foo"
})

Basic DataFrame Operations

2.1 Viewing Data

Show the first five rows with data.head(). Get shape with data.shape and data types with data.dtypes. Check for nulls using data['name'].isnull().

data.head()

data.shape

data.dtypes

data['name'].isnull()

2.2 Row and Column Operations

Add a new row by appending a Series, delete a row with data.drop([990]), add a column with data['xx'] = range(len(data)), and delete a column with data.drop('序号', axis=1).

dic = {
    'name': '前端开发',
    'salary': '2万-2.5万',
    'company': '上海科技有限公司',
    'address': '上海',
    'eduBack': '本科',
    'companyType': '民营',
    'scale': '1000-10000人',
    'info': '小程序'
}
new_row = pd.Series(dic)
new_row.name = 38738
data = data.append(new_row)

data = data.drop([990])

data['xx'] = range(len(data))

data = data.drop('序号', axis=1)

2.3 Indexing

Label‑based indexing with .loc and position‑based indexing with .iloc. Examples include selecting a single cell, a column slice, and multi‑column slices.

data.loc[10, 'salary']

data.loc[:, 'name'][:5]

data.iloc[2]  # third row

data.iloc[:5, :4]  # first 5 rows, first 4 columns

2.4 Hierarchical Indexing

Create a multi‑level Series and DataFrame.

s = pd.Series(np.arange(1, 10), index=[list('aaabbccdd'), [1,2,3,1,2,3,1,2,3]])

df = pd.DataFrame(np.arange(12).reshape(4,3),
    index=[["a","a","b","b"],[1,2,1,2]],
    columns=[["X","X","Y"],["m","n","t"]])

Data Pre‑processing

3.1 Handling Missing Values

Create a simple DataFrame, detect missing values with .isnull(), fill them with .fillna(0, inplace=True), and drop rows or columns that are entirely null.

df = pd.DataFrame({
    'state': ['a','b','c','d'],
    'year': [1991,1992,1993,1994],
    'pop': [6.0,7.0,8.0,np.NaN]
})
df['pop'].isnull()
df['pop'].fillna(0, inplace=True)
df.dropna(how='all')

3.2 String Processing

Trim spaces and convert case.

df['A'] = df['A'].str.strip()
df['A'] = df['A'].str.lower()

3.3 Duplicate Handling

Remove duplicate values, keeping either the last or first occurrence, and replace specific values.

df['A'] = df['A'].drop_duplicates()
df['A'] = df['A'].drop_duplicates(keep='last')
df['A'].replace('sh', 'shanghai')

DataFrame Operations

Grouping

Group by a column (e.g., job name) and perform aggregate calculations.

group = data.groupby(data['name'])

Concatenation

Combine multiple DataFrames vertically.

frames = [df1, df2, df3]
result = pd.concat(frames)

Merge

Merge two DataFrames on a common key.

result = pd.merge(left, right, on='key')

Time Series

5.1 Generating a Period Range

date = pd.period_range(start='20210913', end='20210919')

5.2 Using Time Series in a DataFrame

index = pd.period_range(start='20210913', end='20210918')
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=index)

Summary

The tutorial demonstrates common Pandas operations—loading data, creating Series/DataFrames, basic inspection, cleaning, hierarchical indexing, grouping, merging, concatenating, and time‑series handling—using the sample zlJob.csv. For detailed explanations, refer to the official Pandas documentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python dataframe Pandas time-series data-cleaning

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.