Master 50 Essential Pandas Exercises to Boost Your Data Skills
This article presents a comprehensive collection of 50 pandas practice problems that guide you through creating Series and DataFrames, performing basic and advanced indexing, grouping, aggregation, data cleaning, hierarchical indexing, and visualisation, each illustrated with clear Python code examples.
Basic Operations
Import pandas as pd and display the version.
import pandas as pd
pd.__version__Create a Series from a list.
arr = [0, 1, 2, 3, 4]
df = pd.Series(arr) # default index starts at 0
dfCreate a Series from a dictionary.
d = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
df = pd.Series(d)
dfCreate a DataFrame from a NumPy array.
dates = pd.date_range('today', periods=6) # index
num_arr = np.random.randn(6, 4)
columns = ['A', 'B', 'C', 'D']
df = pd.DataFrame(num_arr, index=dates, columns=columns)
dfRead a CSV file with semicolon separator and GBK encoding.
df = pd.read_csv('test.csv', encoding='gbk', sep=';')Create a DataFrame from a dictionary and set a custom index.
import numpy as np
data = {
'animal': ['cat','cat','snake','dog','dog','cat','snake','cat','dog','dog'],
'age': [2.5,3,0.5,np.nan,5,2,4.5,np.nan,7,3],
'visits': [1,3,2,3,2,3,1,1,2,1],
'priority': ['yes','yes','no','yes','no','no','no','yes','no','no']
}
labels = ['a','b','c','d','e','f','g','h','i','j']
df = pd.DataFrame(data, index=labels)
dfShow basic information of df (row count, column names, data types).
df.info()
# alternative: df.describe()Display the first three rows of df.
df.iloc[:3]
# alternative: df.head(3)Select the animal and age columns.
df.loc[:, ['animal', 'age']]
# alternative: df[['animal', 'age']]Select rows with index [3,4,8] and the animal and age columns.
df.loc[df.index[[3,4,8]], ['animal', 'age']]Filter rows where age > 3.
df[df['age'] > 3]Filter rows where age is missing.
df[df['age'].isnull()]Filter rows where age is between 2 and 4 (exclusive).
df[(df['age'] > 2) & (df['age'] < 4)]
# alternative: df[df['age'].between(2,4)]Set the age of row label 'f' to 1.5.
df.loc['f', 'age'] = 1.5Calculate the sum of the visits column.
df['visits'].sum()Compute the average age for each distinct animal.
df.groupby('animal')['age'].mean()Insert a new row with label 'k' and then delete it.
# insert
df.loc['k'] = [5.5, 'dog', 'no', 2]
# delete
df = df.drop('k')
dfCount occurrences of each animal.
df['animal'].value_counts()Sort by age descending then visits ascending.
df.sort_values(by=['age','visits'], ascending=[False, True])Map priority values "yes"/"no" to booleans.
df['priority'] = df['priority'].map({'yes': True, 'no': False})
dfReplace "snake" with "python" in the animal column.
df['animal'] = df['animal'].replace('snake', 'python')
dfPivot to get average age for each animal ‑ visits combination.
df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean')Advanced Operations
Remove duplicate rows based on integer column A.
df = pd.DataFrame({'A':[1,2,2,3,4,5,5,5,6,7,7]})
df1 = df.loc[df['A'].shift() != df['A']]
# alternative: df.drop_duplicates(subset='A')Subtract the row mean from each element of a numeric DataFrame.
df = pd.DataFrame(np.random.random(size=(5,3)))
df1 = df.sub(df.mean(axis=1), axis=0)Find the column with the smallest sum in a 5‑column DataFrame.
df = pd.DataFrame(np.random.random(size=(5,5)), columns=list('abcde'))
df.sum().idxmin()For each value in column A, sum the top‑3 corresponding B values.
df = pd.DataFrame({'A':list('aaabbcaabcccbbc'),
'B':[12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
result = df.groupby('A')['B'].nlargest(3).sum(level=0)Group column A into bins of size 10 (0‑100) and sum B per bin.
df = pd.DataFrame({'A':[1,2,11,11,33,34,35,40,79,99],
'B':[1,2,11,11,33,34,35,40,79,99]})
result = df.groupby(pd.cut(df['A'], np.arange(0,101,10)))['B'].sum()Compute distance to the nearest preceding zero in column X and store in Y.
df = pd.DataFrame({'X':[7,2,0,3,4,2,5,0,3,4]})
izero = np.r_[-1, (df['X'] == 0).to_numpy().nonzero()[0]]
idx = np.arange(len(df))
df['Y'] = idx - izero[np.searchsorted(izero-1, idx)-1]Return the coordinates of the three largest values in a numeric DataFrame.
df = pd.DataFrame(np.random.random(size=(5,3)))
coords = df.unstack().sort_values()[-3:].index.tolist()Replace negative values in column vals with the mean of their group.
df = pd.DataFrame({'grps':list('aaabbcaabcccbbc'),
'vals':[-12,345,3,1,45,14,4,-52,54,23,-235,21,57,3,87]})
def replace(group):
mask = group < 0
group[mask] = group[~mask].mean()
return group
df['vals'] = df.groupby('grps')['vals'].transform(replace)Calculate a 3‑point rolling average per group, ignoring NaN.
df = pd.DataFrame({'group':list('aabbabbbabab'),
'value':[1,2,3,np.nan,2,3,np.nan,1,7,3,np.nan,8]})
# numerator: sum of non‑NaN values over a 3‑point window per group
num = df.fillna(0).groupby('group')['value'].rolling(3, min_periods=1).sum()
# denominator: count of non‑NaN values over the same window
den = df.groupby('group')['value'].rolling(3, min_periods=1).count()
result = (num / den).reset_index(level=0, drop=True).sort_index()Series and Datetime Index
Create a Series indexed by all business days of 2015 with random values.
dti = pd.date_range(start='2015-01-01', end='2015-12-31', freq='B')
s = pd.Series(np.random.rand(len(dti)), index=dti)
s.head(10)Sum values that fall on Wednesdays.
s[s.index.weekday == 2].sum()Compute the monthly mean.
s.resample('M').mean()Group every four months and return the date of the maximum value.
s.groupby(pd.Grouper(freq='4M')).idxmax()Generate the sequence of the third Thursday of each month from 2015 to 2016.
pd.date_range('2015-01-01', '2016-12-31', freq='WOM-3THU')Data Cleaning
df = pd.DataFrame({
'From_To': ['LoNDon_paris','MAdrid_miLAN','londON_StockhOlm','Budapest_PaRis','Brussels_londOn'],
'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
'RecentDelays': [[23,47], [], [24,43,87], [13], [67,32]],
'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', '12. Air France', '"Swiss Air"']
})Interpolate missing FlightNumber values and cast to integer.
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)Split From_To into separate From and To columns.
temp = df.From_To.str.split('_', expand=True)
temp.columns = ['From', 'To']
df = df.join(temp).drop('From_To', axis=1)Capitalize the first letter of From and To.
df['From'] = df['From'].str.capitalize()
df['To'] = df['To'].str.capitalize()Extract clean airline names by removing punctuation.
df['Airline'] = df['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip()Expand the list of recent delays into separate columns delay_1, delay_2, …
delays = df['RecentDelays'].apply(pd.Series)
delays.columns = ['delay_{}'.format(n) for n in range(1, len(delays.columns)+1)]
df = df.drop('RecentDelays', axis=1).join(delays)Hierarchical Index
letters = ['A','B','C']
numbers = list(range(4))
mi = pd.MultiIndex.from_product([letters, numbers])
s = pd.Series(np.random.rand(12), index=mi)Check whether the Series index is lexicographically sorted.
s.index.is_lexsorted()
# alternative: s.index.lexsort_depth == s.index.nlevelsSelect rows where the second level index is 1 or 3.
s.loc[:, [1,3]]Slice the Series up to level 'B' and from second‑level index 2 onward.
s.loc[pd.IndexSlice[:'B', 2:]]
# alternative: s.loc[slice(None, 'B'), slice(2, None)]Compute the sum for each top‑level label (A, B, C).
s.sum(level=0)
# alternative: s.unstack().sum(axis=0)Swap index levels, verify sorting, and sort if necessary.
new_s = s.swaplevel(0,1)
new_s.index.is_lexsorted()
new_s = new_s.sort_index()Visualization
import matplotlib.pyplot as plt
df = pd.DataFrame({"xs":[1,5,2,8,1], "ys":[4,2,1,9,6]})
plt.style.use('ggplot')Scatter plot of the DataFrame.
df.plot.scatter("xs", "ys", color="black", marker="x")Scatter plot with size and colour encoding.
df = pd.DataFrame({
"productivity":[5,2,3,1,4,5,6,7,8,3,4,8,9],
"hours_in":[1,9,6,5,3,9,2,9,1,7,4,2,2],
"happiness":[2,1,3,2,3,1,2,3,1,2,2,1,3],
"caffienated":[0,0,1,1,0,0,0,0,1,1,0,1,0]
})
df.plot.scatter("hours_in", "productivity", s=df.happiness*100, c=df.caffienated)Combine a bar chart (revenue) and a line chart (advertising) sharing the X‑axis but with a secondary Y‑axis.
df = pd.DataFrame({
"revenue":[57,68,63,71,72,90,80,62,59,51,47,52],
"advertising":[2.1,1.9,2.7,3.0,3.6,3.2,2.7,2.4,1.8,1.6,1.3,1.9],
"month":range(12)
})
ax = df.plot.bar("month", "revenue", color="green")
df.plot.line("month", "advertising", secondary_y=True, ax=ax)
ax.set_xlim((-1,12))Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
