Fundamentals 19 min read

Master 50 Essential Pandas Exercises to Boost Your Data Skills

This article presents a comprehensive collection of 50 pandas practice problems that guide you through creating Series and DataFrames, performing basic and advanced indexing, grouping, aggregation, data cleaning, hierarchical indexing, and visualisation, each illustrated with clear Python code examples.

Python Crawling & Data Mining

Apr 7, 2020

Master 50 Essential Pandas Exercises to Boost Your Data Skills

Basic Operations

Import pandas as pd and display the version.

import pandas as pd
pd.__version__

Create a Series from a list.

arr = [0, 1, 2, 3, 4]
df = pd.Series(arr)  # default index starts at 0
df

Create a Series from a dictionary.

d = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
df = pd.Series(d)
df

Create a DataFrame from a NumPy array.

dates = pd.date_range('today', periods=6)  # index
num_arr = np.random.randn(6, 4)
columns = ['A', 'B', 'C', 'D']
df = pd.DataFrame(num_arr, index=dates, columns=columns)
df

Read a CSV file with semicolon separator and GBK encoding.

df = pd.read_csv('test.csv', encoding='gbk', sep=';')

Create a DataFrame from a dictionary and set a custom index.

import numpy as np

data = {
    'animal': ['cat','cat','snake','dog','dog','cat','snake','cat','dog','dog'],
    'age': [2.5,3,0.5,np.nan,5,2,4.5,np.nan,7,3],
    'visits': [1,3,2,3,2,3,1,1,2,1],
    'priority': ['yes','yes','no','yes','no','no','no','yes','no','no']
}
labels = ['a','b','c','d','e','f','g','h','i','j']

df = pd.DataFrame(data, index=labels)
df

Show basic information of df (row count, column names, data types).

df.info()
# alternative: df.describe()

Display the first three rows of df.

df.iloc[:3]
# alternative: df.head(3)

Select the animal and age columns.

df.loc[:, ['animal', 'age']]
# alternative: df[['animal', 'age']]

Select rows with index [3,4,8] and the animal and age columns.

df.loc[df.index[[3,4,8]], ['animal', 'age']]

Filter rows where age > 3.

df[df['age'] > 3]

Filter rows where age is missing.

df[df['age'].isnull()]

Filter rows where age is between 2 and 4 (exclusive).

df[(df['age'] > 2) & (df['age'] < 4)]
# alternative: df[df['age'].between(2,4)]

Set the age of row label 'f' to 1.5.

df.loc['f', 'age'] = 1.5

Calculate the sum of the visits column.

df['visits'].sum()

Compute the average age for each distinct animal.

df.groupby('animal')['age'].mean()

Insert a new row with label 'k' and then delete it.

# insert
df.loc['k'] = [5.5, 'dog', 'no', 2]
# delete
df = df.drop('k')
df

Count occurrences of each animal.

df['animal'].value_counts()

Sort by age descending then visits ascending.

df.sort_values(by=['age','visits'], ascending=[False, True])

Map priority values "yes"/"no" to booleans.

df['priority'] = df['priority'].map({'yes': True, 'no': False})
df

Replace "snake" with "python" in the animal column.

df['animal'] = df['animal'].replace('snake', 'python')
df

Pivot to get average age for each animal ‑ visits combination.

df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean')

Advanced Operations

Remove duplicate rows based on integer column A.

df = pd.DataFrame({'A':[1,2,2,3,4,5,5,5,6,7,7]})
df1 = df.loc[df['A'].shift() != df['A']]
# alternative: df.drop_duplicates(subset='A')

Subtract the row mean from each element of a numeric DataFrame.

df = pd.DataFrame(np.random.random(size=(5,3)))
df1 = df.sub(df.mean(axis=1), axis=0)

Find the column with the smallest sum in a 5‑column DataFrame.

df = pd.DataFrame(np.random.random(size=(5,5)), columns=list('abcde'))
df.sum().idxmin()

For each value in column A, sum the top‑3 corresponding B values.

df = pd.DataFrame({'A':list('aaabbcaabcccbbc'),
                   'B':[12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
result = df.groupby('A')['B'].nlargest(3).sum(level=0)

Group column A into bins of size 10 (0‑100) and sum B per bin.

df = pd.DataFrame({'A':[1,2,11,11,33,34,35,40,79,99],
                   'B':[1,2,11,11,33,34,35,40,79,99]})
result = df.groupby(pd.cut(df['A'], np.arange(0,101,10)))['B'].sum()

Compute distance to the nearest preceding zero in column X and store in Y.

df = pd.DataFrame({'X':[7,2,0,3,4,2,5,0,3,4]})
izero = np.r_[-1, (df['X'] == 0).to_numpy().nonzero()[0]]
idx = np.arange(len(df))
df['Y'] = idx - izero[np.searchsorted(izero-1, idx)-1]

Return the coordinates of the three largest values in a numeric DataFrame.

df = pd.DataFrame(np.random.random(size=(5,3)))
coords = df.unstack().sort_values()[-3:].index.tolist()

Replace negative values in column vals with the mean of their group.

df = pd.DataFrame({'grps':list('aaabbcaabcccbbc'),
                   'vals':[-12,345,3,1,45,14,4,-52,54,23,-235,21,57,3,87]})

def replace(group):
    mask = group < 0
    group[mask] = group[~mask].mean()
    return group

df['vals'] = df.groupby('grps')['vals'].transform(replace)

Calculate a 3‑point rolling average per group, ignoring NaN.

df = pd.DataFrame({'group':list('aabbabbbabab'),
                    'value':[1,2,3,np.nan,2,3,np.nan,1,7,3,np.nan,8]})
# numerator: sum of non‑NaN values over a 3‑point window per group
num = df.fillna(0).groupby('group')['value'].rolling(3, min_periods=1).sum()
# denominator: count of non‑NaN values over the same window
den = df.groupby('group')['value'].rolling(3, min_periods=1).count()
result = (num / den).reset_index(level=0, drop=True).sort_index()

Series and Datetime Index

Create a Series indexed by all business days of 2015 with random values.

dti = pd.date_range(start='2015-01-01', end='2015-12-31', freq='B')
s = pd.Series(np.random.rand(len(dti)), index=dti)
s.head(10)

Sum values that fall on Wednesdays.

s[s.index.weekday == 2].sum()

Compute the monthly mean.

s.resample('M').mean()

Group every four months and return the date of the maximum value.

s.groupby(pd.Grouper(freq='4M')).idxmax()

Generate the sequence of the third Thursday of each month from 2015 to 2016.

pd.date_range('2015-01-01', '2016-12-31', freq='WOM-3THU')

Data Cleaning

df = pd.DataFrame({
    'From_To': ['LoNDon_paris','MAdrid_miLAN','londON_StockhOlm','Budapest_PaRis','Brussels_londOn'],
    'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
    'RecentDelays': [[23,47], [], [24,43,87], [13], [67,32]],
    'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', '12. Air France', '"Swiss Air"']
})

Interpolate missing FlightNumber values and cast to integer.

df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)

Split From_To into separate From and To columns.

temp = df.From_To.str.split('_', expand=True)
temp.columns = ['From', 'To']
df = df.join(temp).drop('From_To', axis=1)

Capitalize the first letter of From and To.

df['From'] = df['From'].str.capitalize()
df['To'] = df['To'].str.capitalize()

Extract clean airline names by removing punctuation.

df['Airline'] = df['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip()

Expand the list of recent delays into separate columns delay_1, delay_2, …

delays = df['RecentDelays'].apply(pd.Series)
delays.columns = ['delay_{}'.format(n) for n in range(1, len(delays.columns)+1)]
df = df.drop('RecentDelays', axis=1).join(delays)

Hierarchical Index

letters = ['A','B','C']
numbers = list(range(4))
mi = pd.MultiIndex.from_product([letters, numbers])
s = pd.Series(np.random.rand(12), index=mi)

Check whether the Series index is lexicographically sorted.

s.index.is_lexsorted()
# alternative: s.index.lexsort_depth == s.index.nlevels

Select rows where the second level index is 1 or 3.

s.loc[:, [1,3]]

Slice the Series up to level 'B' and from second‑level index 2 onward.

s.loc[pd.IndexSlice[:'B', 2:]]
# alternative: s.loc[slice(None, 'B'), slice(2, None)]

Compute the sum for each top‑level label (A, B, C).

s.sum(level=0)
# alternative: s.unstack().sum(axis=0)

Swap index levels, verify sorting, and sort if necessary.

new_s = s.swaplevel(0,1)
new_s.index.is_lexsorted()
new_s = new_s.sort_index()

Visualization

import matplotlib.pyplot as plt
df = pd.DataFrame({"xs":[1,5,2,8,1], "ys":[4,2,1,9,6]})
plt.style.use('ggplot')

Scatter plot of the DataFrame.

df.plot.scatter("xs", "ys", color="black", marker="x")

Scatter plot with size and colour encoding.

df = pd.DataFrame({
    "productivity":[5,2,3,1,4,5,6,7,8,3,4,8,9],
    "hours_in":[1,9,6,5,3,9,2,9,1,7,4,2,2],
    "happiness":[2,1,3,2,3,1,2,3,1,2,2,1,3],
    "caffienated":[0,0,1,1,0,0,0,0,1,1,0,1,0]
})
df.plot.scatter("hours_in", "productivity", s=df.happiness*100, c=df.caffienated)

Combine a bar chart (revenue) and a line chart (advertising) sharing the X‑axis but with a secondary Y‑axis.

df = pd.DataFrame({
    "revenue":[57,68,63,71,72,90,80,62,59,51,47,52],
    "advertising":[2.1,1.9,2.7,3.0,3.6,3.2,2.7,2.4,1.8,1.6,1.3,1.9],
    "month":range(12)
})
ax = df.plot.bar("month", "revenue", color="green")
df.plot.line("month", "advertising", secondary_y=True, ax=ax)
ax.set_xlim((-1,12))

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data cleaning dataframe series

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.