Fundamentals 14 min read

Master Pandas: From Data Import to Advanced Manipulation in Python

This tutorial walks you through pandas fundamentals—including reading CSV/Excel files, creating Series and DataFrames, performing basic operations, cleaning data, using loc/iloc indexing, grouping, concatenating, merging, and handling time series—providing code examples and visual outputs for each step.

Python Crawling & Data Mining

Jan 28, 2025

Master Pandas: From Data Import to Advanced Manipulation in Python

Pandas Introduction

This article provides a comprehensive tutorial on using pandas for data manipulation in Python, covering data import, creation of Series and DataFrames, basic operations, cleaning, indexing, grouping, aggregation, concatenation, merging, and time‑series handling.

Data Import

Read CSV and Excel files:

pd.read_csv('file.csv')

pd.read_excel('file.xlsx')

Directory Structure

Generate data tables

Basic table operations

Data cleaning

Time series

1. Generating Data Tables

1.1 Data Reading

Typical data sources are CSV or Excel files. Example for CSV:

pd.read_csv('zlJob.csv')

1.2 Creating Data

Create a Series (1‑D) and a DataFrame (2‑D):

s = pd.Series([1, 2, 3, 4, 5])

df2 = pd.DataFrame({
    "A": 1.0,
    "B": pd.Timestamp("20130102"),
    "C": pd.Series(1, index=list(range(4)), dtype="float32"),
    "D": np.array([3] * 4, dtype="int32"),
    "E": pd.Categorical(["test", "train", "test", "train"]),
    "F": "foo"
})

2. Basic Table Operations

2.1 Viewing Data

Show first five rows:

data.head()  # default is 5 rows

Basic information:

data.shape

(990, 9)

data.dtypes

Check for missing values in a column:

data['name'].isnull()

2.2 Row and Column Operations

Add a new row:

dic = {
    'name': '前端开发',
    'salary': '2万-2.5万',
    'company': '上海科技有限公司',
    'adress': '上海',
    'eduBack': '本科',
    'companyType': '民营',
    'scale': '1000-10000人',
    'info': '小程序'
}
df = pd.Series(dic)
df.name = 38738
data = data.append(df)
data.tail()

Delete a row: data = data.drop([990]) Add a new column: data['xx'] = range(len(data)) Delete a column (axis=1 means column):

data = data.drop('序号', axis=1)

2.3 Indexing

Label‑based indexing with loc:

data.loc[10, 'salary']  # returns salary at index 10

data.loc[:, 'name'][:5]

Position‑based indexing with iloc:

data.iloc[2]  # third row

data.iloc[:5, :4]  # first 5 rows, first 4 columns

2.4 Hierarchical Indexing

Series with multi‑level index:

s = pd.Series(np.arange(1, 10), index=[list('aaabbccdd'), [1,2,3,1,2,3,1,2,3]])

DataFrame with multi‑level index:

df = pd.DataFrame(np.arange(12).reshape(4,3), index=[["a","a","b","b"],[1,2,1,2]], columns=[["X","X","Y"],["m","n","t"]])

3. Data Pre‑processing

3.1 Handling Missing Values

Create a simple table with a missing value:

df = pd.DataFrame({
    'state': ['a','b','c','d'],
    'year': [1991,1992,1993,1994],
    'pop': [6.0,7.0,8.0,np.NaN]
})
print(df)

Check missing values:

df['pop'].isnull()

Fill missing values with 0:

df['pop'].fillna(0, inplace=True)
print(df)

Drop rows where all values are missing:

data.dropna(how='all')

3.2 String Processing

df['A'] = df['A'].str.strip()
df['A'] = df['A'].str.lower()

3.3 Duplicate Handling

df['A'] = df['A'].drop_duplicates()
# keep last occurrence
df['A'] = df['A'].drop_duplicates(keep='last')
df['A'].replace('sh', 'shanghai')

4. Table Operations

Grouping

Group by a column (e.g., job name):

group = data.groupby(data['name'])
print(group)

Group objects can be used for aggregation such as mean or sum.

Aggregation (concat)

pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)

Key parameters are explained in the pandas documentation.

Merge

pd.merge(left, right, how='inner', on='key')
# left and right are DataFrames with a common column 'key'

5. Time Series

5.1 Generating a Date Range

date = pd.period_range(start='20210913', end='20210919')
print(date)

PeriodIndex(['2021-09-13', '2021-09-14', '2021-09-15', '2021-09-16', '2021-09-17', '2021-09-18', '2021-09-19'], dtype='period[D]', freq='D')

5.2 Using Time Series in pandas

index = pd.period_range(start='20210913', end='20210918')
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=index)
print(df)

6. Conclusion

This article demonstrated common pandas operations on the sample file zlJob.csv, including data import, creation, inspection, cleaning, indexing, grouping, concatenation, merging, and time‑series handling. For more detailed explanations, refer to the official pandas documentation.

https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data cleaning time series groupby

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.