Fundamentals 6 min read

Remove Duplicates and Concatenate Strings in Pandas Grouped Data

This article demonstrates how to de‑duplicate rows and concatenate string values within each id‑type group in a pandas DataFrame using drop_duplicates, groupby, agg, set, unique, and str.join techniques.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Remove Duplicates and Concatenate Strings in Pandas Grouped Data

Rescue pandas plan (17) – De‑duplicating and concatenating string columns per group

Many users avoid pandas for data manipulation, so this series aims to show why pandas is worth using.

Data requirement: In a dataset with duplicate entries, the fruits column should be grouped by id and type, then the fruit names concatenated with commas while removing duplicate names within each group.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id': np.tile((['A'], ['B'], ['C']), 6).flatten(),
    'type': np.tile((1, 2, 3), 6),
    'fruits': ['苹果', '香蕉', '梨', '苹果', '桃子', '西瓜', '香蕉', '梨', '苹果', '西红柿', '梨', '西瓜', '西瓜', '西瓜', '香蕉', '西瓜', '西瓜', '西瓜']
})

Requirement breakdown

The task is to merge multiple fruit records per (id, type) group, joining them with commas while eliminating duplicates (e.g., ['苹果', '香蕉'] → '苹果, 香蕉').

Solution

drop_duplicates

Using drop_duplicates() before grouping removes duplicate rows:

df.drop_duplicates().groupby(['id', 'type'], as_index=False).agg(lambda x: ', '.join(x))

This removes extra "苹果" entries in the (A, 1) group.

Within the lambda, x is a Series; join works on any iterable, not only lists.

set

Duplicates can also be removed after aggregation by converting the grouped values to a set:

df.groupby(['id', 'type']).agg(set)['fruits'].map(lambda x: ', '.join(x)).reset_index()

Note that after agg(set) the result is a DataFrame; you must select the fruits column before further processing.

unique

Another approach uses the unique() method on a Series and str.join():

df.groupby(['id', 'type'])['fruits'].unique().str.join(', ').reset_index()

The fruits column is a string type, so .str.join() can concatenate elements directly, similar to ''.join but works on any iterable.

When using apply on a GroupBy object, the function receives a Series, so column names cannot be referenced as keyword arguments.

Summary

There are multiple ways to de‑duplicate and concatenate grouped string data in pandas; this article showcases a few methods—using drop_duplicates, agg(set), and unique().str.join() —to deepen understanding of groupby and related functions for handling varied data shapes.

问门前落叶几片,答曰河中鱼两条。

May 27, 2022

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

pandasgroupbyduplicate removalString concatenation
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.