5 Practical Pandas Techniques to De‑duplicate Time‑Stamped Excel Data
This article demonstrates five pandas‑based methods for extracting unique hourly records from an Excel file, showing how to manipulate datetime columns, drop duplicates, and save the cleaned data, while comparing the strengths and pitfalls of each approach.
Introduction
In a Python community chat, a user asked how to automate office tasks by extracting unique hourly entries from an Excel dataset. Below are the original data screenshots.
Implementation Process
Five feasible solutions were provided by two contributors.
Method 1: Extract date and hour, then drop duplicates
import pandas as pd
excel_filename = '数据.xlsx'
df = pd.read_excel(excel_filename)
# 方法一:分别取日期与小时,按照日期和小时删除重复项
df['day'] = df['SampleTime'].dt.day # 提取日期列
df['hour'] = df['SampleTime'].dt.hour # 提取小时列
df = df.drop_duplicates(subset=['day', 'hour']) # 删除重复项
# 把筛选结果保存为excel文件
df.to_excel('数据筛选结果2.xlsx')Method 2: Replace minutes and seconds with zero
import pandas as pd
excel_filename = '数据.xlsx'
df = pd.read_excel(excel_filename)
# 方法二:把日期中的分秒替换为0
SampleTime_new = df['SampleTime'].map(lambda x: x.replace(minute=0, second=0))
data = df[SampleTime_new.duplicated() == False]
print(df)
# 把筛选结果保存为excel文件
df.to_excel('数据筛选结果2.xlsx')Method 3: Floor datetime to the hour
import pandas as pd
excel_filename = '数据.xlsx'
df = pd.read_excel(excel_filename)
# 方法三:对日期时间按照小时进行分辨
SampleTime_new = df['SampleTime'].dt.floor(freq='H')
df = df[SampleTime_new.duplicated() == False]
print(df)
# 把筛选结果保存为excel文件
df.to_excel('数据筛选结果2.xlsx')Method 4: Convert datetime to hourly period
import pandas as pd
excel_filename = '数据.xlsx'
df = pd.read_excel(excel_filename)
# 方法四:对日期时间按照小时进行分辨
SampleTime_new = df['SampleTime'].dt.to_period(freq='H')
df = df[SampleTime_new.duplicated() == False]
print(df)
# 把筛选结果保存为excel文件
df.to_excel('数据筛选结果2.xlsx')Method 5: Reformat datetime and drop duplicates on the new column
import pandas as pd
excel_filename = '数据.xlsx'
df = pd.read_excel(excel_filename)
# 方法五:对日期时间进行重新格式,并按照新的日期时间删除重复项(会引入新列)
df['new'] = df['SampleTime'].dt.strftime('%Y-%m-%d %H')
df = df.drop_duplicates(subset=['new'])
print(df)
# 把筛选结果保存为excel文件
df.to_excel('数据筛选结果2.xlsx')Method 6: Use openpyxl to process the file
from openpyxl import load_workbook, Workbook
from datetime import datetime
# 打开数据工作簿
workbook = load_workbook('数据.xlsx')
# 打开工作表
sheet = workbook.active
time_column = sheet['C']
row_lst = []
date_lst = []
hour_lst = []
for cell in time_column:
if cell.value != "SampleTime" and cell.value != None:
if cell.value.date() not in date_lst:
date_lst.append(cell.value.date())
# print(cell.value.date())
for date in date_lst:
for cell in time_column:
if cell.value != "SampleTime" and cell.value != None:
if cell.value.date() == date:
if cell.value.hour not in hour_lst:
hour_lst.append(cell.value.hour)
row_lst.append(cell.row)
hour_lst = []
# 创建和原数据一样的表头(第一行)
header = sheet[1]
header_lst = []
for cell in header:
header_lst.append(cell.value)
new_workbook = Workbook()
new_sheet = new_workbook.active
new_sheet.append(header_lst)
# 从旧表中根据行号提取符合条件的行,并写入新表
for row in row_lst:
data_lst = []
for cell in sheet[row]:
data_lst.append(cell.value)
new_sheet.append(data_lst)
# 最后切记保存
new_workbook.save('新表.xlsx')
print("满足条件的新表保存完成!")Brief Summary
The first five methods share similar ideas: methods 1 and 5 truncate minutes and seconds, while methods 3 and 4 operate on hourly periods; methods 1, 2, and 5 essentially zero out minutes and seconds. The code snippet
df['new'] = df['SampleTime'].dt.strftime('%Y-%m-%d %H:00:00')illustrates this equivalence.
Method 2 and 3 were contributed by "月神", while methods 1, 4, and 5 came from "瑜亮老师". "月神" used floor to truncate, whereas ceil would incorrectly shift times and cause data loss.
Method 6 demonstrates a manual approach using openpyxl: iterate over dates, collect one unique hour per day, and write the selected rows to a new workbook.
Conclusion
This article shares five effective pandas techniques for extracting unique hourly records from an Excel file and an additional openpyxl method, encouraging readers to experiment with alternative implementations and contribute their own solutions.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
