Pandas Tips Summary: 13 Practical Techniques for Efficient Data Analysis
This article compiles thirteen concise pandas techniques—including missing‑value calculation, group‑by max row extraction, multi‑column aggregation, string filtering, memory optimization, and more—to help Python users perform data analysis more efficiently and effectively.
This article compiles thirteen concise pandas techniques to help Python users perform data analysis more efficiently, covering missing‑value calculation, group‑by operations, column merging, row filtering, sorting, type selection, memory optimization, and other practical tips.
1. Calculate Variable Missing Rate
Read a CSV file and define a function that returns each column's missing‑value percentage.
df = pd.read_csv('titanic_train.csv')
def missing_cal(df):
"""Calculate missing rate for each column in the DataFrame."""
missing_series = df.isnull().sum() / df.shape[0]
missing_df = pd.DataFrame(missing_series).reset_index()
missing_df = missing_df.rename(columns={'index': 'col', 0: 'missing_pct'})
missing_df = missing_df.sort_values('missing_pct', ascending=False).reset_index(drop=True)
return missing_df
missing_cal(df)Set axis=1 to compute missing rates across rows.
2. Retrieve Row with Maximum Value in Each Group
Two scenarios are shown: groups without duplicate maximums and groups with duplicates.
# Example without duplicates
import pandas as pd
df = pd.DataFrame({
'Sp': ['a','b','c','d','e','f'],
'Mt': ['s1','s1','s2','s2','s2','s3'],
'Value': [1,2,3,4,5,6],
'Count': [3,2,5,10,10,6]
})
# Get rows where 'Count' is maximal within each 'Mt' group
result = df.iloc[df.groupby(['Mt']).apply(lambda x: x['Count'].idxmax())]
print(result)For groups with duplicate maximums, use rank after grouping:
df['rank'] = df.groupby('ID')['score'].rank(method='min', ascending=False).astype(np.int64)
filtered = df[df['rank'] == 1][['ID', 'class']]
print(filtered)3. Merge Multiple Columns into One Row
df = pd.DataFrame({
'id_part': ['a','b','c','d'],
'pred': [0.1,0.2,0.3,0.4],
'pred_class': ['women','man','cat','dog'],
'v_id': ['d1','d2','d3','d1']
})
merged = df.groupby(['v_id']).agg({
'pred_class': [', '.join],
'pred': lambda x: list(x),
'id_part': 'first'
}).reset_index()
print(merged)4. Delete Rows Containing a Specific Substring
df = pd.DataFrame({
'a': [1,2,3,4],
'b': ['s1', 'exp_s2', 's3', 'exps4'],
'c': [5,6,7,8],
'd': [3,2,5,10]
})
filtered = df[df['b'].str.contains('exp')]
print(filtered)5. In‑Group Sorting
Two efficient methods for sorting within groups are demonstrated.
# Method 1: sort by multiple columns
sorted_df = df.sort_values(['name','score'], ascending=[True, False])
# Method 2: apply sort within each group
sorted_df = df.groupby('name').apply(lambda x: x.sort_values('score', ascending=False)).reset_index(drop=True)
print(sorted_df)6. Select Columns of Specific Data Types
drinks = pd.read_csv('data/drinks.csv')
# Numeric columns
numeric = drinks.select_dtypes(include=['number']).head()
# Object (string) columns
objects = drinks.select_dtypes(include=['object']).head()
# Multiple types
mixed = drinks.select_dtypes(include=['number','object','category','datetime']).head()
# Exclude numeric columns
excluded = drinks.select_dtypes(exclude=['number']).head()
print(numeric, objects, mixed, excluded)7. Convert String Columns to Numeric
df = pd.DataFrame({
'列1': ['1.1','2.2','3.3'],
'列2': ['4.4','5.5','6.6'],
'列3': ['7.7','8.8','-']
})
# Direct conversion (fails for column with '-')
print(df.astype({'列1':'float','列2':'float'}).dtypes)
# Use to_numeric with coercion
df = df.apply(pd.to_numeric, errors='coerce').fillna(0)
print(df)8. Optimize DataFrame Memory Usage
Method 1: Load only required columns using usecols .
cols = ['beer_servings','continent']
small_drinks = pd.read_csv('data/drinks.csv', usecols=cols)
print(small_drinks.head())Method 2: Convert object columns with categorical data to category dtype.
dtypes = {'continent': 'category'}
smaller_drinks = pd.read_csv('data/drinks.csv', usecols=cols, dtype=dtypes)
print(smaller_drinks.info())9. Filter DataFrame by Top N Categories
movies = pd.read_csv('data/imdb_1000.csv')
counts = movies.genre.value_counts()
top_movies = movies[movies.genre.isin(counts.nlargest(3).index)].head()
print(top_movies)10. Split a String Column into Multiple Columns
df = pd.DataFrame({
'姓名': ['张 三','李 四','王 五'],
'所在地': ['北京-东城区','上海-黄浦区','广州-白云区']
})
# Split name by space
name_split = df.姓名.str.split(' ', expand=True)
print(name_split)11. Convert List Elements in a Series to a DataFrame
df = pd.DataFrame({
'列1': ['a','b','c'],
'列2': [[10,20], [20,30], [30,40]]
})
expanded = df.列2.apply(pd.Series)
result = pd.concat([df, expanded], axis='columns')
print(result)12. Aggregate with Multiple Functions
orders = pd.read_csv('data/chipotle.tsv', sep='\t')
agg = orders.groupby('order_id').item_price.agg(['sum','count']).head()
print(agg)13. Groupby Aggregation Examples
import pandas as pd, numpy as np
df = pd.DataFrame({
'key1': ['a','a','b','b','a'],
'key2': ['one','two','one','two','one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)
})
# Iterate groups
for name, group in df.groupby('key1'):
print(name)
print(group)
# Convert groups to dictionary
group_dict = dict(list(df.groupby('key1')))
print(group_dict)
# Group by column mapping
people = pd.DataFrame(np.random.randn(5,5), columns=list('abcde'), index=['Joe','Steve','Wes','Jim','Travis'])
mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
by_column = people.groupby(mapping, axis=1)
print(by_column.sum())All code snippets are presented unchanged within tags to preserve their original form.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.