How to Remove Duplicate Rows in Pandas While Keeping the Highest Value
This article explains how to use pandas' sort_values and drop_duplicates functions to eliminate duplicate entries in a DataFrame, preserving the row with the maximum age (or other criteria), and provides multiple code examples and parameter details for effective data cleaning.
Introduction
During a Python community discussion, a user asked how to use pandas to delete duplicate rows while keeping the record with the highest age. The solution involves sorting the DataFrame and then dropping duplicates.
Initial Code
import pandas as pd
data = [{
'name': '小明', 'age': 18},
{'name': '小张', 'age': 20},
{'name': '小明', 'age': 20},
{'name': '小明', 'age': 38}
]
data = pd.DataFrame(data)
# delete duplicate names, keep the one with the largest age
# data = data.drop_duplicates('name', inplace=False)
print(data)Improved Solution
The suggested approach first sorts the data by the age column in descending order and then drops duplicates based on the name column.
import pandas as pd
data = [{
'name': '小明', 'age': 18},
{'name': '小张', 'age': 20},
{'name': '小明', 'age': 20},
{'name': '小明', 'age': 38}
]
data = pd.DataFrame(data)
# sort by age descending, then drop duplicate names, keeping the first (largest age)
result = data.sort_values(by="age", ascending=False).drop_duplicates('name', inplace=False)
print(result)sort_values() Function Overview
The sort_values() function works similarly to SQL's ORDER BY, allowing you to sort a DataFrame by one or more columns.
Key Parameters
by : column name(s) or index level(s) to sort by.
axis : 0 or 'index' to sort rows (default), 1 or 'columns' to sort columns.
ascending : boolean, default True for ascending order.
inplace : boolean, default False; if True, modifies the DataFrame in place.
na_position : {'first', 'last'} – position of NaN values.
Examples
Single‑condition duplicate removal (keep the oldest age)
import pandas as pd
data = [{
'name': '小明', 'age': 18, 'high': 155},
{'name': '小张', 'age': 20, 'high': 145},
{'name': '小明', 'age': 38, 'high': 175},
{'name': '小明', 'age': 38, 'high': 195}
]
data = pd.DataFrame(data)
result = data.sort_values('age', ascending=False).drop_duplicates('name')
print(result)Multi‑condition duplicate removal (age then height)
import pandas as pd
data = [{
'name': '小明', 'age': 18, 'high': 155},
{'name': '小张', 'age': 20, 'high': 145},
{'name': '小明', 'age': 38, 'high': 175},
{'name': '小明', 'age': 38, 'high': 195}
]
data = pd.DataFrame(data)
result = data.sort_values(['age', 'high'], ascending=False).drop_duplicates('name')
print(result)Conclusion
The article demonstrates how to apply sort_values() together with drop_duplicates() to effectively clean data by removing duplicate rows while retaining the row with the maximum value of a chosen column.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
