Remove Duplicates and Concatenate Strings in Pandas Grouped Data
This article demonstrates how to de‑duplicate rows and concatenate string values within each id‑type group in a pandas DataFrame using drop_duplicates, groupby, agg, set, unique, and str.join techniques.
Rescue pandas plan (17) – De‑duplicating and concatenating string columns per group
Many users avoid pandas for data manipulation, so this series aims to show why pandas is worth using.
Data requirement: In a dataset with duplicate entries, the fruits column should be grouped by id and type, then the fruit names concatenated with commas while removing duplicate names within each group.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': np.tile((['A'], ['B'], ['C']), 6).flatten(),
'type': np.tile((1, 2, 3), 6),
'fruits': ['苹果', '香蕉', '梨', '苹果', '桃子', '西瓜', '香蕉', '梨', '苹果', '西红柿', '梨', '西瓜', '西瓜', '西瓜', '香蕉', '西瓜', '西瓜', '西瓜']
})Requirement breakdown
The task is to merge multiple fruit records per (id, type) group, joining them with commas while eliminating duplicates (e.g., ['苹果', '香蕉'] → '苹果, 香蕉').
Solution
drop_duplicates
Using drop_duplicates() before grouping removes duplicate rows:
df.drop_duplicates().groupby(['id', 'type'], as_index=False).agg(lambda x: ', '.join(x))This removes extra "苹果" entries in the (A, 1) group.
Within the lambda, x is a Series; join works on any iterable, not only lists.
set
Duplicates can also be removed after aggregation by converting the grouped values to a set:
df.groupby(['id', 'type']).agg(set)['fruits'].map(lambda x: ', '.join(x)).reset_index()Note that after agg(set) the result is a DataFrame; you must select the fruits column before further processing.
unique
Another approach uses the unique() method on a Series and str.join():
df.groupby(['id', 'type'])['fruits'].unique().str.join(', ').reset_index()The fruits column is a string type, so .str.join() can concatenate elements directly, similar to ''.join but works on any iterable.
When using apply on a GroupBy object, the function receives a Series, so column names cannot be referenced as keyword arguments.
Summary
There are multiple ways to de‑duplicate and concatenate grouped string data in pandas; this article showcases a few methods—using drop_duplicates, agg(set), and unique().str.join() —to deepen understanding of groupby and related functions for handling varied data shapes.
问门前落叶几片,答曰河中鱼两条。
May 27, 2022
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
