Understanding Pearson, Spearman, and Kendall Correlation Coefficients with Pandas
Learn how Pearson, Spearman, and Kendall correlation coefficients measure linear and monotonic relationships between variables, explore their mathematical properties, interpret their value ranges, and see practical Python examples using Pandas to compute each coefficient on generated data.
Pearson Correlation Coefficient
Consider a dataset with two features x and y, each having n values, forming n pairs (x_i, y_i). The Pearson correlation coefficient measures the linear relationship between the two features. It is the covariance of x and y divided by the product of their standard deviations, usually denoted by r.
Here, μ_x and μ_y denote the means of x and y. The formula shows that if larger x values tend to correspond to larger y values, r is positive; if larger x values tend to correspond to smaller y values, r is negative. Important facts about Pearson's r:
The coefficient can take any real value in the range [-1, 1]. The maximum value +1 corresponds to a perfect positive linear relationship.
A value of 0 indicates no linear correlation.
A value of -1 indicates a perfect negative linear relationship.
In short, the larger the absolute value of r, the stronger the linear correlation; the closer to zero, the weaker.
Spearman Correlation Coefficient
The Spearman correlation coefficient is the Pearson correlation applied to the rank values of the two features. It uses the ranks instead of the raw values and is denoted by ρ (rho). For two tuples (x_i, y_i), compute the ranks of x and y and then apply the Pearson formula.
It ranges between -1 and 1.
The maximum value +1 corresponds to a monotonic increasing relationship.
The minimum value -1 corresponds to a monotonic decreasing relationship.
Kendall Correlation Coefficient
Consider two tuples (x_i, y_i). Each pair of observations can be concordant, discordant, or tied. The Kendall tau coefficient compares the number of concordant and discordant pairs relative to the total number of pairs. It is denoted by τ.
It can take any real value in the range [-1, 1].
The maximum value +1 occurs when all pairs are concordant.
The minimum value -1 occurs when all pairs are discordant.
Calculating Correlations with Pandas
Generate data and plot
<code>import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
xarray = np.linspace(0, 10, 100) # generate 100 numbers from 0 to 10
yarray = xarray**3 + np.random.normal(0, 100, 100) # y = x^3 + normal noise
plt.scatter(xarray, yarray)
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
</code>Convert to Pandas Series
<code>xseries = pd.Series(xarray) # convert to Series
yseries = pd.Series(yarray)
</code>Compute correlations
Pearson
<code>xseries.corr(yseries, method='pearson') # Pearson correlation</code>Result: 0.840850116329609
Spearman
<code>xseries.corr(yseries, method='spearman') # Spearman correlation</code>Result: 0.8455325532553255
Kendall
<code>xseries.corr(yseries, method='kendall') # Kendall correlation</code>Result: 0.6755555555555557
Reference: Data STUDIO https://mp.weixin.qq.com/s/3XR2_0Mca50-rZO9ZRAzuA
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.