Fundamentals 9 min read

5 Efficient Pandas Techniques to Remove Duplicate Dates and Hours from Excel

This article walks through five pandas-based methods—and an optional openpyxl approach—to extract unique date‑hour entries from an Excel file, explaining each technique, its code, and the nuances of handling datetime rounding and duplicate removal.

Python Crawling & Data Mining

Apr 23, 2022

5 Efficient Pandas Techniques to Remove Duplicate Dates and Hours from Excel

1. Introduction

The author, a Python enthusiast, shares a question from a Python automation group about extracting unique date‑hour records from an Excel file using pandas.

2. Implementation

Five pandas‑based methods are presented:

Method 1: Extract day and hour, then drop duplicates

import pandas as pd

excel_filename = '数据.xlsx'
df = pd.read_excel(excel_filename)
# Extract day and hour
df['day'] = df['SampleTime'].dt.day
df['hour'] = df['SampleTime'].dt.hour
# Drop duplicates based on day and hour
df = df.drop_duplicates(subset=['day', 'hour'])
# Save result
df.to_excel('数据筛选结果2.xlsx')

Method 2: Replace minutes and seconds with zero

import pandas as pd

excel_filename = '数据.xlsx'
df = pd.read_excel(excel_filename)
SampleTime_new = df['SampleTime'].map(lambda x: x.replace(minute=0, second=0))
data = df[SampleTime_new.duplicated() == False]
print(df)
df.to_excel('数据筛选结果2.xlsx')

Method 3: Floor datetime to the hour

import pandas as pd

excel_filename = '数据.xlsx'
df = pd.read_excel(excel_filename)
SampleTime_new = df['SampleTime'].dt.floor(freq='H')
df = df[SampleTime_new.duplicated() == False]
print(df)
df.to_excel('数据筛选结果2.xlsx')

Method 4: Convert datetime to period hour

import pandas as pd

excel_filename = '数据.xlsx'
df = pd.read_excel(excel_filename)
SampleTime_new = df['SampleTime'].dt.to_period(freq='H')
df = df[SampleTime_new.duplicated() == False]
print(df)
df.to_excel('数据筛选结果2.xlsx')

Method 5: Reformat datetime string and drop duplicates

import pandas as pd

excel_filename = '数据.xlsx'
df = pd.read_excel(excel_filename)
df['new'] = df['SampleTime'].dt.strftime('%Y-%m-%d %H')
df = df.drop_duplicates(subset=['new'])
print(df)
df.to_excel('数据筛选结果2.xlsx')

These methods share common steps: extracting the hour component or zeroing minutes/seconds, then removing duplicate rows. Methods 1 and 5 both truncate to hour precision; methods 3 and 4 both group by hour. The choice between floor and ceil rounding affects which rows are kept, as ceil can shift times forward and cause unintended data loss.

Method 6: Using openpyxl

An alternative approach using openpyxl was attempted but only extracted 24 rows; the code is retained for future reference.

from openpyxl import load_workbook, Workbook
from datetime import datetime

# Open workbook
workbook = load_workbook('数据.xlsx')
# Active sheet
sheet = workbook.active
time_column = sheet['C']
row_lst = []
date_lst = []
hour_lst = []
for cell in time_column:
    if cell.value != "SampleTime" and cell.value is not None:
        if cell.value.date() not in date_lst:
            date_lst.append(cell.value.date())

for date in date_lst:
    for cell in time_column:
        if cell.value != "SampleTime" and cell.value is not None:
            if cell.value.date() == date:
                if cell.value.hour not in hour_lst:
                    hour_lst.append(cell.value.hour)
                    row_lst.append(cell.row)
    hour_lst = []

# Create new workbook and copy header
new_workbook = Workbook()
new_sheet = new_workbook.active
header = sheet[1]
header_lst = [cell.value for cell in header]
new_sheet.append(header_lst)

# Copy selected rows
for row in row_lst:
    data_lst = [cell.value for cell in sheet[row]]
    new_sheet.append(data_lst)

new_workbook.save('新表.xlsx')
print("满足条件的新表保存完成！")

This method iterates over each date, collects a single unique hour per day, and writes the matching rows to a new Excel file.

3. Conclusion

The article demonstrates practical pandas (and optional openpyxl) techniques for filtering Excel data by unique date‑hour combinations and saving the cleaned results, offering multiple viable solutions for Python‑based data deduplication.

Python Deduplication datetime pandas data-cleaning

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.