Fundamentals 8 min read

Comprehensive Guide to Pandas Data Processing in Python

This tutorial provides a detailed overview of Pandas, covering its core data structures, data import/export, selection, cleaning, aggregation, merging, and a practical sales analysis example, with complete code snippets for each operation.

php中文网 Courses
php中文网 Courses
php中文网 Courses
Comprehensive Guide to Pandas Data Processing in Python

Pandas is one of the most powerful data processing libraries in Python, widely used for data cleaning, transformation, analysis, and visualization. This chapter details core data processing techniques, including data reading, filtering, aggregation, merging, and handling missing values.

1. Introduction to Pandas Data Structures

Pandas mainly provides two data structures:

Series : a one‑dimensional array, similar to a labeled list.

DataFrame : a two‑dimensional table structure, similar to Excel or an SQL table, and the most commonly used structure.

1.1 Creating a DataFrame

import pandas as pd

# Create from a dictionary
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "London", "Tokyo"]
}

df = pd.DataFrame(data)
print(df)

Output:

Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Tokyo

2. Data Reading and Export

Pandas supports reading and storing multiple data formats:

2.1 Reading CSV/Excel/SQL Data

# Read CSV
 df = pd.read_csv("data.csv")

# Read Excel
 df = pd.read_excel("data.xlsx")

# Read from an SQLite database
 import sqlite3
 conn = sqlite3.connect("database.db")
 df = pd.read_sql("SELECT * FROM users", conn)

2.2 Exporting Data

# Save as CSV
 df.to_csv("output.csv", index=False)

# Save as Excel
 df.to_excel("output.xlsx", index=False)

3. Data Selection and Query

3.1 Selecting Columns and Rows

# Select a single column
 ages = df["Age"]

# Select multiple columns
 subset = df[["Name", "City"]]

# Filter rows by condition
 young_people = df[df["Age"] < 30]

3.2 Using loc and iloc

loc : label‑based selection.

iloc : integer‑position based selection.

# Select the first row (label based)
 row = df.loc[0]

# Select the first two rows (position based)
 rows = df.iloc[0:2]

4. Data Cleaning and Processing

4.1 Handling Missing Values

# Check missing values
 print(df.isnull().sum())

# Drop missing values
 df_cleaned = df.dropna()

# Fill missing values with 0
 df_filled = df.fillna(0)

4.2 Removing Duplicates

df.drop_duplicates(inplace=True)

4.3 Data Transformation

# Convert strings to lowercase
 df["Name"] = df["Name"].str.lower()

# Normalize numeric column
 df["Age"] = (df["Age"] - df["Age"].mean()) / df["Age"].std()

5. Data Aggregation and Grouping

5.1 groupby Aggregation

# Group by city and compute average age
 grouped = df.groupby("City")["Age"].mean()
 print(grouped)

5.2 Pivot Table

pivot_table = df.pivot_table(index="City", values="Age", aggfunc="mean")
 print(pivot_table)

6. Data Merging and Joining

6.1 concat Merge

df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df2 = pd.DataFrame({"A": [5, 6], "B": [7, 8]})
combined = pd.concat([df1, df2], ignore_index=True)

6.2 merge Join (SQL‑like)

left = pd.DataFrame({"key": ["A", "B"], "value": [1, 2]})
right = pd.DataFrame({"key": ["A", "B"], "value": [3, 4]})
merged = pd.merge(left, right, on="key", suffixes=("_left", "_right"))

7. Practical Example: Sales Data Analysis

Assume there is a sales data file sales.csv , we can perform the following analysis:

sales = pd.read_csv("sales.csv")

# Total revenue per product
 product_sales = sales.groupby("Product")["Revenue"].sum().sort_values(ascending=False)

# Monthly sales trend
 sales["Date"] = pd.to_datetime(sales["Date"])
 monthly_sales = sales.resample("M", on="Date")["Revenue"].sum()

# Visualization
 import matplotlib.pyplot as plt
 monthly_sales.plot(kind="line", title="Monthly Sales Trend")
 plt.show()

Conclusion

Pandas core operations: data reading, selection, cleaning, aggregation, merging.

Key functions: groupby , pivot_table , merge , dropna .

Applicable scenarios: data analysis, data cleaning, business intelligence (BI), machine‑learning preprocessing.

Mastering Pandas data processing techniques can greatly improve data analysis efficiency, providing a solid foundation for subsequent visualization (Matplotlib/Seaborn) and machine learning (Scikit‑learn).

Java learning materials download

C language learning materials download

Frontend learning materials download

C++ learning materials download

PHP learning materials download

Pythondata analysisdata cleaningdata aggregationdata-mergepandas
php中文网 Courses
Written by

php中文网 Courses

php中文网's platform for the latest courses and technical articles, helping PHP learners advance quickly.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.