Fundamentals 9 min read

Pandas Tips Summary: 13 Practical Techniques for Efficient Data Analysis

This article compiles thirteen concise pandas techniques—including missing‑value calculation, group‑by max row extraction, multi‑column aggregation, string filtering, memory optimization, and more—to help Python users perform data analysis more efficiently and effectively.

Python Programming Learning Circle

Apr 29, 2022

Pandas Tips Summary: 13 Practical Techniques for Efficient Data Analysis

This article compiles thirteen concise pandas techniques to help Python users perform data analysis more efficiently, covering missing‑value calculation, group‑by operations, column merging, row filtering, sorting, type selection, memory optimization, and other practical tips.

1. Calculate Variable Missing Rate

Read a CSV file and define a function that returns each column's missing‑value percentage.

df = pd.read_csv('titanic_train.csv')

def missing_cal(df):
    """Calculate missing rate for each column in the DataFrame."""
    missing_series = df.isnull().sum() / df.shape[0]
    missing_df = pd.DataFrame(missing_series).reset_index()
    missing_df = missing_df.rename(columns={'index': 'col', 0: 'missing_pct'})
    missing_df = missing_df.sort_values('missing_pct', ascending=False).reset_index(drop=True)
    return missing_df

missing_cal(df)

Set axis=1 to compute missing rates across rows.

2. Retrieve Row with Maximum Value in Each Group

Two scenarios are shown: groups without duplicate maximums and groups with duplicates.

# Example without duplicates
import pandas as pd

df = pd.DataFrame({
    'Sp': ['a','b','c','d','e','f'],
    'Mt': ['s1','s1','s2','s2','s2','s3'],
    'Value': [1,2,3,4,5,6],
    'Count': [3,2,5,10,10,6]
})

# Get rows where 'Count' is maximal within each 'Mt' group
result = df.iloc[df.groupby(['Mt']).apply(lambda x: x['Count'].idxmax())]
print(result)

For groups with duplicate maximums, use rank after grouping:

df['rank'] = df.groupby('ID')['score'].rank(method='min', ascending=False).astype(np.int64)
filtered = df[df['rank'] == 1][['ID', 'class']]
print(filtered)

3. Merge Multiple Columns into One Row

df = pd.DataFrame({
    'id_part': ['a','b','c','d'],
    'pred': [0.1,0.2,0.3,0.4],
    'pred_class': ['women','man','cat','dog'],
    'v_id': ['d1','d2','d3','d1']
})

merged = df.groupby(['v_id']).agg({
    'pred_class': [', '.join],
    'pred': lambda x: list(x),
    'id_part': 'first'
}).reset_index()
print(merged)

4. Delete Rows Containing a Specific Substring

df = pd.DataFrame({
    'a': [1,2,3,4],
    'b': ['s1', 'exp_s2', 's3', 'exps4'],
    'c': [5,6,7,8],
    'd': [3,2,5,10]
})

filtered = df[df['b'].str.contains('exp')]
print(filtered)

5. In‑Group Sorting

Two efficient methods for sorting within groups are demonstrated.

# Method 1: sort by multiple columns
sorted_df = df.sort_values(['name','score'], ascending=[True, False])

# Method 2: apply sort within each group
sorted_df = df.groupby('name').apply(lambda x: x.sort_values('score', ascending=False)).reset_index(drop=True)
print(sorted_df)

6. Select Columns of Specific Data Types

drinks = pd.read_csv('data/drinks.csv')
# Numeric columns
numeric = drinks.select_dtypes(include=['number']).head()
# Object (string) columns
objects = drinks.select_dtypes(include=['object']).head()
# Multiple types
mixed = drinks.select_dtypes(include=['number','object','category','datetime']).head()
# Exclude numeric columns
excluded = drinks.select_dtypes(exclude=['number']).head()
print(numeric, objects, mixed, excluded)

7. Convert String Columns to Numeric

df = pd.DataFrame({
    '列1': ['1.1','2.2','3.3'],
    '列2': ['4.4','5.5','6.6'],
    '列3': ['7.7','8.8','-']
})
# Direct conversion (fails for column with '-')
print(df.astype({'列1':'float','列2':'float'}).dtypes)
# Use to_numeric with coercion
df = df.apply(pd.to_numeric, errors='coerce').fillna(0)
print(df)

8. Optimize DataFrame Memory Usage

Method 1: Load only required columns using usecols.

cols = ['beer_servings','continent']
small_drinks = pd.read_csv('data/drinks.csv', usecols=cols)
print(small_drinks.head())

Method 2: Convert object columns with categorical data to category dtype.

dtypes = {'continent': 'category'}
smaller_drinks = pd.read_csv('data/drinks.csv', usecols=cols, dtype=dtypes)
print(smaller_drinks.info())

9. Filter DataFrame by Top N Categories

movies = pd.read_csv('data/imdb_1000.csv')
counts = movies.genre.value_counts()
top_movies = movies[movies.genre.isin(counts.nlargest(3).index)].head()
print(top_movies)

10. Split a String Column into Multiple Columns

df = pd.DataFrame({
    '姓名': ['张 三','李 四','王 五'],
    '所在地': ['北京-东城区','上海-黄浦区','广州-白云区']
})
# Split name by space
name_split = df.姓名.str.split(' ', expand=True)
print(name_split)

11. Convert List Elements in a Series to a DataFrame

df = pd.DataFrame({
    '列1': ['a','b','c'],
    '列2': [[10,20], [20,30], [30,40]]
})
expanded = df.列2.apply(pd.Series)
result = pd.concat([df, expanded], axis='columns')
print(result)

12. Aggregate with Multiple Functions

orders = pd.read_csv('data/chipotle.tsv', sep='\t')
agg = orders.groupby('order_id').item_price.agg(['sum','count']).head()
print(agg)

13. Groupby Aggregation Examples

import pandas as pd, numpy as np

df = pd.DataFrame({
    'key1': ['a','a','b','b','a'],
    'key2': ['one','two','one','two','one'],
    'data1': np.random.randn(5),
    'data2': np.random.randn(5)
})
# Iterate groups
for name, group in df.groupby('key1'):
    print(name)
    print(group)

# Convert groups to dictionary
group_dict = dict(list(df.groupby('key1')))
print(group_dict)

# Group by column mapping
people = pd.DataFrame(np.random.randn(5,5), columns=list('abcde'), index=['Joe','Steve','Wes','Jim','Travis'])
mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
by_column = people.groupby(mapping, axis=1)
print(by_column.sum())

All code snippets are presented unchanged within

tags to preserve their original form.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data analysis dataframe pandas groupby missing values Tips

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.