Fundamentals 18 min read

Master Pandas: A Step‑by‑Step Guide to Data Analysis with Python

This comprehensive tutorial introduces Pandas—the powerful Python library for data manipulation and analysis—covers installation, data import, inspection, cleaning, indexing, selection, sorting, grouping, transformation, statistical functions, visualization, and exporting, all illustrated with clear code examples and visual outputs.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Master Pandas: A Step‑by‑Step Guide to Data Analysis with Python

Pandas Overview

Pandas is a third‑party Python library designed for flexible data processing and analysis, especially for numeric and time‑series data, but also capable of handling textual data. It was created by Wes McKinney in 2008 and its name derives from the econometrics term “panel data”.

Python Introduction

Python is a powerful, easy‑to‑learn interpreted language with rich data structures, cross‑platform support, and extensive use in data science, machine learning, and AI. Beginners are advised to start with Python 3.6 or later.

Installation and Import

pip install pandas matplotlib

For slower networks, use a domestic mirror:

pip install pandas matplotlib -i https://pypi.tuna.tsinghua.edu.cn/simple

Import the library in a Jupyter Notebook:

import pandas as pd

Dataset Preparation

The tutorial uses a sample Excel file team.xlsx containing student quarterly scores. The file can be downloaded from https://www.gairuo.com/file/data/dataset/team.xlsx . The key columns are name , team , and Q1–Q4 .

Sample of team.xlsx
Sample of team.xlsx

Reading Data

df = pd.read_excel('https://www.gairuo.com/file/data/dataset/team.xlsx')
# or df = pd.read_excel('team.xlsx')
# For CSV files use pd.read_csv()

The DataFrame df now holds the data.

Viewing Data

df.head()      # first 5 rows
df.tail()      # last 5 rows
df.sample(5)   # random 5 rows
df.head() output
df.head() output

Data Verification

df.shape        # (rows, columns)
df.info()       # index, dtypes, memory usage
df.describe()   # statistical summary
df.dtypes       # column types
df.columns      # column names

Setting Index

df.set_index('name', inplace=True)
name set as index
name set as index

Data Selection

Column selection

df['Q1']          # single column
df[['team','Q1']] # multiple columns
df.loc[:, ['team','Q1']]

Row selection

df[df.index == 'Liver']   # by index value
df[0:3]                   # first three rows
df.iloc[:10, :]           # first ten rows

Label‑based selection

df.loc['Ben', 'Q1':'Q4']
df.loc['Eorge':'Alexander', 'team':'Q4']

Conditional filtering

df[df.Q1 > 90]
df[df.team == 'C']
df[(df['Q1'] > 90) & (df['team'] == 'C')]

Sorting

df.sort_values(by='Q1')
df.sort_values(by='Q1', ascending=False)
df.sort_values(['team','Q1'], ascending=[True,False])

Group‑by Aggregation

df.groupby('team').sum()
df.groupby('team').mean()
df.groupby('team').agg({'Q1':'sum','Q2':'count','Q3':'mean','Q4':'max'})
groupby sum
groupby sum
groupby agg
groupby agg

Data Transformation

df.groupby('team').sum().T
transpose
transpose

Adding Columns

df['one'] = 1
df['total'] = df['Q1'] + df['Q2'] + df['Q3'] + df['Q4']
df['total'] = df.loc[:, 'Q1':'Q4'].apply(lambda x: sum(x), axis=1)
df['total'] = df.sum(axis=1)
df['avg'] = df['total'] / 4

Statistical Functions

df.mean()
df.mean(1)
df.corr()
df.count()
df.max()
df.min()
df.median()
df.std()
df.var()
df.mode()

Visualization

# line plot of Q1
df['Q1'].plot()
# line plot for a specific student
df.loc['Ben','Q1':'Q4'].plot()
# bar and horizontal bar
df.loc['Ben','Q1':'Q4'].plot.bar()
df.loc['Ben','Q1':'Q4'].plot.barh()
# multiple lines for each team
df.groupby('team').sum().T.plot()
# pie chart of team sizes
df.groupby('team').count().Q1.plot.pie()
line plot
line plot
bar chart
bar chart
horizontal bar
horizontal bar
multiple line plot
multiple line plot
pie chart
pie chart

Exporting Data

df.to_excel('team-done.xlsx')
df.to_csv('team-done.csv')

The exported files are saved in the same directory as the notebook.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Pythondata analysisData SciencepandasJupyter Notebook
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.