Extract Highest‑Priority Tags per User with Pandas: Two Easy GroupBy Techniques
This article demonstrates how to use pandas groupby, list aggregation, sorting, deduplication, and explode (or a custom apply function) to retrieve each user's top‑priority tag while preserving other tag columns, comparing two practical implementations and their performance trade‑offs.
Rescue pandas plan (4) – DataFrame Grouping Condition Lookup
Data Requirement
Based on each user's judgment tags, ordered as A, B, C, D… , we need to obtain the highest‑priority data for each user while keeping the other tag columns.
Requirement Decomposition
Since we need to extract the highest‑priority tag per user, we can group by user and perform the lookup inside each group. Two implementation methods are provided.
Requirement Processing
Method 1
Wrap the "Other Tags" column into a list, then aggregate within each user‑tag group using sum to concatenate the lists.
df['其他标签'] = df['其他标签'].map(lambda x: [x])Group and sum the lists:
df = df.groupby(['用户', '判断标签'], as_index=False)['其他标签'].sum()Sort the judgment tags using the key parameter (available in newer pandas versions) to apply a custom order mapping.
df.sort_values('判断标签', key=lambda x: x.map({'甲':1, '乙':2, '丙':3, '丁':4}), inplace=True)Remove duplicate rows, keeping the first occurrence (the highest‑priority tag).
df.drop_duplicates('用户', inplace=True)Finally, explode the list column to obtain one row per tag.
df.explode('其他标签')Method 2
Define a helper function that returns the rows whose judgment tag matches the first (i.e., highest‑priority) tag in the group, then apply it after sorting.
def get_first_label(data):
"""Return rows with the top‑sorted judgment tag in each group"""
return data[data['判断标签'] == data.head(1)['判断标签'].values[0]]
# sort tags first
df.sort_values('判断标签', key=lambda x: x.map({'甲':1, '乙':2, '丙':3, '丁':4}), inplace=True)
# apply per user
result = df.groupby(['用户']).apply(get_first_label).reset_index(drop=True)Summary
Grouped lookup is a common data‑processing need. Method 1 uses built‑in pandas aggregation and is generally faster, while Method 2 is more concise but may be slower on large datasets because groupby.apply invokes a Python function for each group. drop_duplicates is a handy tool for keeping the first occurrence.
The sun will rise tomorrow, and we will continue to shine.
Written on 2022‑01‑14
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
