Fundamentals 6 min read

Understanding Pearson, Spearman, and Kendall Correlation Coefficients with Pandas

Learn how Pearson, Spearman, and Kendall correlation coefficients measure linear and monotonic relationships between variables, explore their mathematical properties, interpret their value ranges, and see practical Python examples using Pandas to compute each coefficient on generated data.

Model Perspective
Model Perspective
Model Perspective
Understanding Pearson, Spearman, and Kendall Correlation Coefficients with Pandas

Pearson Correlation Coefficient

Consider a dataset with two features x and y, each having n values, forming n pairs (x_i, y_i). The Pearson correlation coefficient measures the linear relationship between the two features. It is the covariance of x and y divided by the product of their standard deviations, usually denoted by r.

Here, μ_x and μ_y denote the means of x and y. The formula shows that if larger x values tend to correspond to larger y values, r is positive; if larger x values tend to correspond to smaller y values, r is negative. Important facts about Pearson's r:

The coefficient can take any real value in the range [-1, 1]. The maximum value +1 corresponds to a perfect positive linear relationship.

A value of 0 indicates no linear correlation.

A value of -1 indicates a perfect negative linear relationship.

In short, the larger the absolute value of r, the stronger the linear correlation; the closer to zero, the weaker.

Spearman Correlation Coefficient

The Spearman correlation coefficient is the Pearson correlation applied to the rank values of the two features. It uses the ranks instead of the raw values and is denoted by ρ (rho). For two tuples (x_i, y_i), compute the ranks of x and y and then apply the Pearson formula.

It ranges between -1 and 1.

The maximum value +1 corresponds to a monotonic increasing relationship.

The minimum value -1 corresponds to a monotonic decreasing relationship.

Kendall Correlation Coefficient

Consider two tuples (x_i, y_i). Each pair of observations can be concordant, discordant, or tied. The Kendall tau coefficient compares the number of concordant and discordant pairs relative to the total number of pairs. It is denoted by τ.

It can take any real value in the range [-1, 1].

The maximum value +1 occurs when all pairs are concordant.

The minimum value -1 occurs when all pairs are discordant.

Calculating Correlations with Pandas

Generate data and plot

<code>import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

xarray = np.linspace(0, 10, 100)  # generate 100 numbers from 0 to 10
yarray = xarray**3 + np.random.normal(0, 100, 100)  # y = x^3 + normal noise

plt.scatter(xarray, yarray)
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
</code>

Convert to Pandas Series

<code>xseries = pd.Series(xarray)  # convert to Series
yseries = pd.Series(yarray)
</code>

Compute correlations

Pearson

<code>xseries.corr(yseries, method='pearson')  # Pearson correlation</code>

Result: 0.840850116329609

Spearman

<code>xseries.corr(yseries, method='spearman')  # Spearman correlation</code>

Result: 0.8455325532553255

Kendall

<code>xseries.corr(yseries, method='kendall')  # Kendall correlation</code>

Result: 0.6755555555555557

Reference: Data STUDIO https://mp.weixin.qq.com/s/3XR2_0Mca50-rZO9ZRAzuA

statisticsdata analysiscorrelationpandaskendallpearsonspearman
Model Perspective
Written by

Model Perspective

Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.