Fundamentals 7 min read

Compute Pairwise Column Similarity in Pandas with Custom Methods

This article demonstrates how to calculate similarity between rows in a pandas DataFrame using both a Cartesian‑product approach and vectorized broadcasting, providing complete Python code, performance notes, and step‑by‑step explanations.

Python Crawling & Data Mining

Jun 24, 2022

Compute Pairwise Column Similarity in Pandas with Custom Methods

Rescue pandas plan (19) – Compute similarity of two columns with a custom method

Many users avoid pandas for data manipulation, so this series aims to showcase pandas' power and help readers fall in love with it.

Series name (article number) – specific problem solved in this article

Platform:

Windows 10

Python 3.8

pandas >=1.2.4

/ Data Requirement

The task is to compute the absolute difference sum between two sample rows, e.g., a sample: [10.0, 9.0, 7.5, 8.6] and b sample: [11.2, 8.7, 6.4, 5.5], resulting in |10.0-11.2|+|9.0-8.7|+|7.5-6.4|+|8.6-5.5|. For each sample we need the five most similar other samples.

Random data generation:

import numpy as np
import pandas as pd

np.random.seed(2022)
data = np.clip(np.random.normal(loc=70, scale=15, size=600).round(2), 0, 100).reshape(60, 10)
df = pd.DataFrame(data, columns=list('abcdefghij'), index=[f'test_{i}' for i in range(1, 61)])

Data size: 60 rows × 10 columns, floating‑point values.

/ Solution 1: Cartesian product

Method 1: Cartesian product

First create a cross‑join of the DataFrame, then group and select the top 5 smallest similarity values for each sample.

This mirrors a solution posted by another contributor.

df = df.reset_index().rename(columns={'index': 'Name'})
# add helper column
df['one'] = 1
# merge with itself on the helper column
df_merge = pd.merge(left=df, right=df, left_on='one', right_on='one')

The original 60 × 10 data expands to 3,600 rows, increasing memory usage; large datasets would see a noticeable slowdown.

Proceeding without considering data‑size impact, we compute similarity:

columns = list(df.columns)
columns.remove('Name')
columns.remove('one')

def sim_fun(row):
    sim_value = 0.0
    for col in columns:
        sim_value += abs(round(row[col+'_x'] - row[col+'_y'], 2))
    return round(sim_value, 2)

# calculate similarity for each pair
df_merge['sim'] = df_merge.apply(sim_fun, axis=1)
# drop self‑comparisons
df_merge = df_merge[df_merge['Name_x'] != df_merge['Name_y']].copy()

def get_top_sims(df_sub):
    df_sort = df_sub.sort_values('sim').head(5)
    names = ','.join(df_sort['Name_y'])
    sims = ','.join(df_sort['sim'].astype(str))
    return pd.Series({'names': names, 'sims': sims})

df_result = df_merge.groupby('Name_x').apply(get_top_sims)

The result shows the five most similar samples for each of the 60 rows, with acceptable execution time.

/ Solution 2: Vectorized broadcasting

Method 2: Direct row‑wise subtraction using pandas broadcasting

Instead of expanding the data, we can exploit NumPy broadcasting to compute differences more efficiently.

def get_sims(s):
    # df is the global DataFrame
    df_sim = (df - s).abs().sum(axis=1).round(2)
    df_sim.pop(s.name)
    df_sim = df_sim.sort_values(ascending=False).head(5).astype(str)
    return pd.Series({'names': ','.join(df_sim.index), 'sims': ','.join(df_sim.values)})

# apply row‑wise
df_result = df.apply(get_sims, axis=1)

Using broadcasting, the custom get_sims function dramatically improves performance compared with the Cartesian‑product method.

/ Summary

This article examined the problem, reviewed existing solutions, and applied pandas techniques—both a Cartesian‑product approach and a vectorized broadcasting method—to compute row similarity while reducing redundant calculations. Readers are encouraged to discuss any unclear points.

Sharing knowledge benefits everyone.

June 21, 2022

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data analysis Pandas cartesian product similarity broadcasting

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.