Compute Pairwise Column Similarity in Pandas with Custom Methods
This article demonstrates how to calculate similarity between rows in a pandas DataFrame using both a Cartesian‑product approach and vectorized broadcasting, providing complete Python code, performance notes, and step‑by‑step explanations.
Rescue pandas plan (19) – Compute similarity of two columns with a custom method
Many users avoid pandas for data manipulation, so this series aims to showcase pandas' power and help readers fall in love with it.
Series name (article number) – specific problem solved in this article
Platform:
Windows 10
Python 3.8
pandas >=1.2.4
/ Data Requirement
The task is to compute the absolute difference sum between two sample rows, e.g., a sample: [10.0, 9.0, 7.5, 8.6] and b sample: [11.2, 8.7, 6.4, 5.5], resulting in |10.0-11.2|+|9.0-8.7|+|7.5-6.4|+|8.6-5.5|. For each sample we need the five most similar other samples.
Random data generation:
import numpy as np
import pandas as pd
np.random.seed(2022)
data = np.clip(np.random.normal(loc=70, scale=15, size=600).round(2), 0, 100).reshape(60, 10)
df = pd.DataFrame(data, columns=list('abcdefghij'), index=[f'test_{i}' for i in range(1, 61)])Data size: 60 rows × 10 columns, floating‑point values.
/ Solution 1: Cartesian product
Method 1: Cartesian product
First create a cross‑join of the DataFrame, then group and select the top 5 smallest similarity values for each sample.
This mirrors a solution posted by another contributor.
df = df.reset_index().rename(columns={'index': 'Name'})
# add helper column
df['one'] = 1
# merge with itself on the helper column
df_merge = pd.merge(left=df, right=df, left_on='one', right_on='one')The original 60 × 10 data expands to 3,600 rows, increasing memory usage; large datasets would see a noticeable slowdown.
Proceeding without considering data‑size impact, we compute similarity:
columns = list(df.columns)
columns.remove('Name')
columns.remove('one')
def sim_fun(row):
sim_value = 0.0
for col in columns:
sim_value += abs(round(row[col+'_x'] - row[col+'_y'], 2))
return round(sim_value, 2)
# calculate similarity for each pair
df_merge['sim'] = df_merge.apply(sim_fun, axis=1)
# drop self‑comparisons
df_merge = df_merge[df_merge['Name_x'] != df_merge['Name_y']].copy()
def get_top_sims(df_sub):
df_sort = df_sub.sort_values('sim').head(5)
names = ','.join(df_sort['Name_y'])
sims = ','.join(df_sort['sim'].astype(str))
return pd.Series({'names': names, 'sims': sims})
df_result = df_merge.groupby('Name_x').apply(get_top_sims)The result shows the five most similar samples for each of the 60 rows, with acceptable execution time.
/ Solution 2: Vectorized broadcasting
Method 2: Direct row‑wise subtraction using pandas broadcasting
Instead of expanding the data, we can exploit NumPy broadcasting to compute differences more efficiently.
def get_sims(s):
# df is the global DataFrame
df_sim = (df - s).abs().sum(axis=1).round(2)
df_sim.pop(s.name)
df_sim = df_sim.sort_values(ascending=False).head(5).astype(str)
return pd.Series({'names': ','.join(df_sim.index), 'sims': ','.join(df_sim.values)})
# apply row‑wise
df_result = df.apply(get_sims, axis=1)Using broadcasting, the custom get_sims function dramatically improves performance compared with the Cartesian‑product method.
/ Summary
This article examined the problem, reviewed existing solutions, and applied pandas techniques—both a Cartesian‑product approach and a vectorized broadcasting method—to compute row similarity while reducing redundant calculations. Readers are encouraged to discuss any unclear points.
Sharing knowledge benefits everyone.
June 21, 2022
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
