Fundamentals 6 min read

How to Remove Duplicate Rows in Pandas While Keeping the Highest Value

This article explains how to use pandas' sort_values and drop_duplicates functions to eliminate duplicate entries in a DataFrame, preserving the row with the maximum age (or other criteria), and provides multiple code examples and parameter details for effective data cleaning.

Python Crawling & Data Mining

Oct 27, 2022

How to Remove Duplicate Rows in Pandas While Keeping the Highest Value

Introduction

During a Python community discussion, a user asked how to use pandas to delete duplicate rows while keeping the record with the highest age. The solution involves sorting the DataFrame and then dropping duplicates.

Initial Code

import pandas as pd

data = [{
    'name': '小明', 'age': 18},
    {'name': '小张', 'age': 20},
    {'name': '小明', 'age': 20},
    {'name': '小明', 'age': 38}
]

data = pd.DataFrame(data)
# delete duplicate names, keep the one with the largest age
# data = data.drop_duplicates('name', inplace=False)
print(data)

Improved Solution

The suggested approach first sorts the data by the age column in descending order and then drops duplicates based on the name column.

import pandas as pd

data = [{
    'name': '小明', 'age': 18},
    {'name': '小张', 'age': 20},
    {'name': '小明', 'age': 20},
    {'name': '小明', 'age': 38}
]

data = pd.DataFrame(data)
# sort by age descending, then drop duplicate names, keeping the first (largest age)
result = data.sort_values(by="age", ascending=False).drop_duplicates('name', inplace=False)
print(result)

sort_values() Function Overview

The sort_values() function works similarly to SQL's ORDER BY, allowing you to sort a DataFrame by one or more columns.

Key Parameters

by : column name(s) or index level(s) to sort by.

axis : 0 or 'index' to sort rows (default), 1 or 'columns' to sort columns.

ascending : boolean, default True for ascending order.

inplace : boolean, default False; if True, modifies the DataFrame in place.

na_position : {'first', 'last'} – position of NaN values.

Examples

Single‑condition duplicate removal (keep the oldest age)

import pandas as pd

data = [{
    'name': '小明', 'age': 18, 'high': 155},
    {'name': '小张', 'age': 20, 'high': 145},
    {'name': '小明', 'age': 38, 'high': 175},
    {'name': '小明', 'age': 38, 'high': 195}
]

data = pd.DataFrame(data)
result = data.sort_values('age', ascending=False).drop_duplicates('name')
print(result)

Multi‑condition duplicate removal (age then height)

import pandas as pd

data = [{
    'name': '小明', 'age': 18, 'high': 155},
    {'name': '小张', 'age': 20, 'high': 145},
    {'name': '小明', 'age': 38, 'high': 175},
    {'name': '小明', 'age': 38, 'high': 195}
]

data = pd.DataFrame(data)
result = data.sort_values(['age', 'high'], ascending=False).drop_duplicates('name')
print(result)

Conclusion

The article demonstrates how to apply sort_values() together with drop_duplicates() to effectively clean data by removing duplicate rows while retaining the row with the maximum value of a chosen column.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Pandas sort-values data-cleaning duplicate-removal

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.