Data Cleaning with Python: Removing Duplicates, Blank Rows, and Formatting Dates and Numbers
This tutorial demonstrates how to use Python and pandas to clean Excel data by removing duplicate and empty rows, stripping spaces, reformatting date strings, and standardising numeric values with two‑decimal precision and currency symbols, providing complete code examples for each step.
Many developers say that since Python and pandas arrived, they rarely use Excel because handling and visualising tabular data becomes much faster.
Below are several practical examples.
1. Delete duplicate rows and empty rows
We convert the current data to a dictionary using dict.fromkeys , which automatically removes duplicate keys, then convert the result back to a list.
<code>In [135]:
for row in rows4:
print(row)
('name', 'address')
('tom li', 'beijing')
('tom li', 'beijing')
('', )
('mary wang', 'shandong')
('mary wang', 'shandong')
('', )
('de8ug', 'guangzhou')
In [148]:
dict.fromkeys(rows4)
Out[148]:{('name', 'address'): None,
('tom li', 'beijing'): None,
('', ): None,
('mary wang', 'shandong'): None,
('de8ug', 'guangzhou'): None}
In [137]:
list(dict.fromkeys(rows4))
Out[137]:[('name', 'address'),
('tom li', 'beijing'),
('', ),
('mary wang', 'shandong'),
('de8ug', 'guangzhou')]
</code>At this point the duplicate data is removed. Note that dict in Python 3 preserves insertion order; if you are still using Python 2 or a version earlier than 3.5, consider upgrading.
Next we handle empty rows. Since ('',) is a tuple with an empty string as its first element, we can filter it out with a comprehension that keeps only rows whose first element length is greater than 1.
<code>In [179]:
list(x for x in dict.fromkeys(rows4) if len(x[0])>1)
Out[179]:[('name', 'address'),
('tom li', 'beijing'),
('mary wang', 'shandong'),
('de8ug', 'guangzhou')]
</code>We then wrap the logic into a function that extracts all rows, clears the sheet, and writes the cleaned rows back.
<code>def handle_duplicate(wb, sheetname):
"""
Remove duplicate rows and empty rows.
First retrieve each row, clear the sheet, then write back the cleaned data.
"""
print(f'开始处理工作表:{sheetname}'.center(18, '-'))
sheet = wb[sheetname]
old_rows = [x for x in sheet.values]
print('修改前:', old_rows)
new_rows = list(x for x in dict.fromkeys(old_rows) if len(x[0])>1)
print('修改后-》》', new_rows)
# Delete all rows
sheet.delete_rows(1, sheet.max_row)
# Write new data
for row in new_rows:
sheet.append(row)
</code>Run the test and verify the result. Remember to debug any errors that appear.
<code>wb = load_data()
handle_duplicate(wb, '重复行')
save_as(wb)
</code>2. Delete spaces
To remove spaces inside a string we can use split followed by ''.join . This eliminates all whitespace without using strip , which only trims the ends.
<code>a = "a b c "
a.strip()
# 'a b c'
a.split()
# ['a', 'b', 'c']
''.join(a.split())
# 'abc'
</code>We encapsulate this logic in a function called handle_blank .
<code>def handle_blank(wb, sheetname):
"""
Iterate over columns and remove spaces from each cell.
"""
print(f'开始处理工作表:{sheetname}'.center(18, '-'))
sheet = wb[sheetname]
for col in sheet.iter_cols():
for cell in col:
print('修改前:', cell.value, end='')
cell.value = ''.join(cell.value.split())
print('修改后-》》', cell.value)
</code> <code>handle_blank(wb, '空格')
</code>3. Modify date and time formats
Sometimes we need to reformat date cells. Using Python's datetime module we can parse month/day strings and output them in the desired format such as %Y-%m-%d or %x .
<code>import datetime
d = datetime.datetime(2019, 1, 11)
print(d.strftime("%x")) # 01/11/19
print(d.strftime("%Y-%m-%d")) # 2019-01-11
</code>We wrap this into a handle_time function that processes the first column starting from the second row.
<code>def handle_time(wb, sheetname):
"""
Iterate over the first column and reformat date strings.
"""
print(f'开始处理工作表:{sheetname}'.center(18, '-'))
sheet = wb[sheetname]
for col in sheet.iter_cols(max_col=1, min_row=2):
for cell in col:
print('修改前:', cell.value, end='')
m, d = cell.value.split('/')
day = datetime.datetime(2019, int(m), int(d))
cell.value = day.strftime("%Y-%m-%d")
print('修改后-》》', cell.value)
</code> <code>wb = load_data()
handle_time(wb, '时间')
save_as(wb)
</code>4. Fix numbers and symbols
For monetary values we often need two‑decimal precision and a currency symbol. The Decimal class can enforce fixed precision, while round may drop trailing zeros.
<code>from decimal import Decimal
a = 3.1
b = Decimal(a).quantize(Decimal("0.00"))
print(b) # 3.10
print(round(a, 2)) # 3.1 (trailing zero omitted)
</code>We create a handle_num function that formats numbers in the third column, adding a Chinese yuan symbol.
<code>def handle_num(wb, sheetname):
"""
Iterate over the third column and format numbers with two decimals and a currency prefix.
"""
print(f'开始处理工作表:{sheetname}'.center(18, '-'))
sheet = wb[sheetname]
for col in sheet.iter_cols(min_col=3, max_col=3, min_row=2):
for cell in col:
print('修改前:', cell.value, end='')
cell.value = '¥' + str(Decimal(cell.value).quantize(Decimal("0.00")))
print('修改后-》》', cell.value)
</code> <code>wb = load_data()
handle_num(wb, '数字符号')
save_as(wb)
</code>All functions should be tested thoroughly to ensure correct data cleaning.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.