Fundamentals 6 min read

How to Filter Duplicate Rows Within 20 Seconds Using Pandas

This article walks through a Python pandas solution that groups a dataset by multiple fields, orders rows by end time, and retains only the first record when successive end times are within a 20‑second window, complete with optimized code snippets and explanations.

Python Crawling & Data Mining

Jan 1, 2024

How to Filter Duplicate Rows Within 20 Seconds Using Pandas

Hello, I'm PiPi. This article addresses a Python automation task where a dataset needs to be grouped by ID, step, reviewer, and amount, sorted by end time, and within each group only the first record is kept when consecutive end times differ by less than 20 seconds.

Problem Description

The source table contains columns: 编号 (ID), 环节 (step), 审核人 (reviewer), 金额 (amount), and 结束时间 (end time). The requirement is to group by the first four columns, sort each group by 结束时间 in ascending order, and if two rows in the same group have end times within 20 seconds, keep only the earliest row.

Initial Implementation

import pandas as pd

def func(df_split):
    last_time = None
    idx = []
    for row in df_split.itertuples():
        if last_time is None or (row.结束时间 - last_time).total_seconds() > 20:
            idx.append(row.Index)
            last_time = row.结束时间
    return df_split.loc[idx]

df = pd.read_excel("工作量计算.xlsx")
df.sort_values(["编号", "环节", "审核人", "金额", "结束时间"]).groupby(
    ["编号", "环节", "审核人", "金额"], as_index=False).apply(func)

The result matches the reference solution and leaves 3394 rows.

Optimized Version

import pandas as pd

def func(df_split):
    last_time = None
    idx = []
    for row in df_split.itertuples():
        if last_time is None or (row.结束时间 - last_time).total_seconds() > 20:
            idx.append(row.Index)
            last_time = row.结束时间
    return df_split.loc[idx]

df = pd.read_excel("工作量计算.xlsx")
result = (
    df.sort_values(["编号", "环节", "审核人", "金额", "结束时间"])
      .groupby(["编号", "环节", "审核人", "金额"], as_index=False)
      .apply(func)
      .droplevel(0)
)
result

Adding .droplevel(0) removes the extra index level.

Further Simplification

import pandas as pd

def filter_rows(group):
    diff = group.结束时间.diff()
    mask = diff.dt.total_seconds() < 20
    return group[~mask].drop_duplicates(keep='first')

df = pd.read_excel('工作量计算.xlsx')
df = df.sort_values(["编号", "环节", "审核人", "金额", "结束时间"])
result = df.groupby(['编号', '环节', '审核人', '金额'], as_index=False).apply(filter_rows).droplevel(0)
result

This version uses diff() and a boolean mask to achieve the same filtering in a more concise way.

Conclusion

The article demonstrates how to solve a real‑world data‑processing problem with pandas, offering three progressively refined implementations that filter rows based on a 20‑second time gap within grouped data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Pandas groupby time filtering

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.