Fundamentals 9 min read

How to Efficiently Retrieve Top‑N Rows per Group in a Pandas DataFrame

This article demonstrates several practical techniques—including sorting before grouping, using drop_duplicates, and leveraging NumPy stacking—to extract the top N rows for each group in a pandas DataFrame, compares their performance, and provides a reusable function for custom use.

Python Crawling & Data Mining

Apr 15, 2022

How to Efficiently Retrieve Top‑N Rows per Group in a Pandas DataFrame

Rescuing pandas (5) – Getting Top‑N rows per DataFrame group

Many users avoid pandas for group‑by top‑N tasks; this series shows how to make pandas attractive again.

Data requirement

Given a DataFrame, we need to group by name, sort by date_col, and retrieve the first N rows of each group.

We use a small‑scale dataset to compare execution times of different approaches.

Requirement breakdown

DataFrames provide head() and tail() for the first/last rows. After groupby, applying head() yields the first rows of each group, but to get top‑N we must sort the DataFrame first.

Solution approaches

Method 1 – Sort then group

Sort the entire DataFrame before grouping, then use groupby(...).head(N). This can be done either by sorting first or by sorting within each group using apply.

Sort first, then group

df.sort_values(['name', 'date_col'], inplace=True)
df.groupby(['name']).head(1)

Group first, then sort

Since groupby cannot be followed directly by sort_values, we use apply to sort each group.

df.groupby(['name']).apply(lambda x: x.sort_values('date_col').head(1)).reset_index(drop=True)

The timing difference is an order of magnitude; the slowdown is caused mainly by apply, not by sort_values.

Method 2 – Use drop_duplicates

When only the first row per group is needed, drop_duplicates(['name']) (default keeps the first) works efficiently after sorting.

df.sort_values(['name', 'date_col'], inplace=True)
df.drop_duplicates(['name'])  # keeps first by default

Method 3 – No groupby (using counts and NumPy)

We can achieve the same result by counting occurrences with value_counts, generating a range for each group, and stacking the results with np.hstack. The generated rank array is then used to filter the original sorted DataFrame.

import numpy as np

df.sort_values(['name', 'date_col'], inplace=True)
name_count = df.value_counts('name', sort=False)
rank = np.hstack(name_count.map(lambda x: range(x)).values)
df = df[rank < k]

Encapsulated as a reusable function:

import numpy as np
import pandas as pd

def get_data_top(data: pd.DataFrame, group_cols: list, val_cols: list, ascending: bool = True, k: int = 1):
    """Return top‑k rows per group.
    Parameters
    ----------
    data: DataFrame
    group_cols: columns to group by
    val_cols: columns to sort by
    ascending: sort order
    k: number of rows per group
    """
    datac = data.reset_index().copy()
    index_colname = datac.columns[0]
    datac.sort_values(group_cols + val_cols, ascending=ascending, inplace=True)
    rank0 = np.hstack(datac.value_counts(group_cols, sort=False).map(lambda x: range(x)).values)
    datac = datac[rank0 < k]
    datac.index = datac[index_colname].values
    del datac[index_colname]
    return datac

Conclusion

Three methods achieve the same top‑N extraction; sorting before grouping is generally the fastest, while apply introduces significant overhead. The NumPy‑based approach offers comparable performance and can be wrapped into a convenient function.

Embrace experimentation and create efficient data‑processing pipelines.

Performance pandas groupby top-n

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.