How to Filter Duplicate Rows Within 20 Seconds Using Pandas
This article walks through a Python pandas solution that groups a dataset by multiple fields, orders rows by end time, and retains only the first record when successive end times are within a 20‑second window, complete with optimized code snippets and explanations.
Hello, I'm PiPi. This article addresses a Python automation task where a dataset needs to be grouped by ID, step, reviewer, and amount, sorted by end time, and within each group only the first record is kept when consecutive end times differ by less than 20 seconds.
Problem Description
The source table contains columns: 编号 (ID), 环节 (step), 审核人 (reviewer), 金额 (amount), and 结束时间 (end time). The requirement is to group by the first four columns, sort each group by 结束时间 in ascending order, and if two rows in the same group have end times within 20 seconds, keep only the earliest row.
Initial Implementation
import pandas as pd
def func(df_split):
last_time = None
idx = []
for row in df_split.itertuples():
if last_time is None or (row.结束时间 - last_time).total_seconds() > 20:
idx.append(row.Index)
last_time = row.结束时间
return df_split.loc[idx]
df = pd.read_excel("工作量计算.xlsx")
df.sort_values(["编号", "环节", "审核人", "金额", "结束时间"]).groupby(
["编号", "环节", "审核人", "金额"], as_index=False).apply(func)The result matches the reference solution and leaves 3394 rows.
Optimized Version
import pandas as pd
def func(df_split):
last_time = None
idx = []
for row in df_split.itertuples():
if last_time is None or (row.结束时间 - last_time).total_seconds() > 20:
idx.append(row.Index)
last_time = row.结束时间
return df_split.loc[idx]
df = pd.read_excel("工作量计算.xlsx")
result = (
df.sort_values(["编号", "环节", "审核人", "金额", "结束时间"])
.groupby(["编号", "环节", "审核人", "金额"], as_index=False)
.apply(func)
.droplevel(0)
)
resultAdding .droplevel(0) removes the extra index level.
Further Simplification
import pandas as pd
def filter_rows(group):
diff = group.结束时间.diff()
mask = diff.dt.total_seconds() < 20
return group[~mask].drop_duplicates(keep='first')
df = pd.read_excel('工作量计算.xlsx')
df = df.sort_values(["编号", "环节", "审核人", "金额", "结束时间"])
result = df.groupby(['编号', '环节', '审核人', '金额'], as_index=False).apply(filter_rows).droplevel(0)
resultThis version uses diff() and a boolean mask to achieve the same filtering in a more concise way.
Conclusion
The article demonstrates how to solve a real‑world data‑processing problem with pandas, offering three progressively refined implementations that filter rows based on a 20‑second time gap within grouped data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
