Fundamentals 6 min read

Using pandas merge() and join() for DataFrame concatenation and performance comparison

This article demonstrates how to use pandas' merge() and join() functions to combine DataFrames, shows examples of merging, joining with single and multi‑index DataFrames, illustrates horizontal concatenation with concat(), and compares the execution speed of merge versus join on large datasets.

Python Programming Learning Circle

Jun 7, 2022

Using pandas merge() and join() for DataFrame concatenation and performance comparison

pandas provides the .merge() and .join() methods to combine DataFrames based on common columns, similar to SQL joins.

First, two dictionaries are created and converted into DataFrames df1 and df2, whose contents are displayed.

import pandas as pd

data1 = {'identification': ['a', 'b', 'c', 'd'],
         'Customer_Name': ['King', 'West', 'Adams', 'Mercy'],
         'Category': ['furniture', 'Office Supplies', 'Technology', 'R_materials']}

data2 = {'identification': ['a', 'b', 'c', 'd'],
         'Class': ['First_Class', 'Second_Class', 'Same_day', 'Standard Class'],
         'Age': [60, 30, 40, 50]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

Using pd.merge(df1, df2, on='identification') merges the two DataFrames into a new one that contains columns from both sources. new_data = pd.merge(df1, df2, on='identification') The article also shows how to join a singly‑indexed DataFrame with a multi‑indexed DataFrame using .join() and the on parameter.

import pandas as pd

# single‑index DataFrame
data1 = {'Customer_Name': ['King', 'West', 'Adams'],
         'Category': ['furniture', 'Office Supplies', 'Technology']}
Ndata = pd.DataFrame(data1, index=pd.Index(['a', 'b', 'c'], name='identification'))

# multi‑index DataFrame
data2 = {'Class': ['First_Class', 'Second_Class', 'Same_day', 'Standard Class'],
         'Age': [60, 30, 40, 50]}
index = pd.MultiIndex.from_tuples([('a', 'x0'), ('b', 'x1'), ('c', 'x2'), ('c', 'x3')],
                                  names=['identification', 'x'])
Ndata2 = pd.DataFrame(data2, index=index)

# join operation
result = Ndata.join(Ndata2, how='inner')

For horizontal concatenation, pd.concat([df1, df2], axis=1) combines the two DataFrames side‑by‑side. new_data = pd.concat([df1, df2], axis=1) A performance experiment compares the execution time of merge and join on DataFrames ranging from 1 million to 10 million rows, revealing that join can be up to five times faster as the data size grows, making it preferable for large‑scale data processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance merge JOIN dataframe pandas

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.