10 Essential Plots for Linear Regression with Python Code Examples
This tutorial explains ten crucial visualizations for linear regression—scatter plot, trend line, residual plot, normal probability plot, learning curve, bias‑variance tradeoff, residuals vs fitted, partial regression, leverage, and Cook's distance—each illustrated with clear Python code using scikit‑learn, matplotlib, seaborn, and statsmodels.
The article introduces ten important chart types that are indispensable when learning or applying linear regression, showing how each plot helps diagnose model performance, data distribution, and assumptions.
Scatter Plot
A scatter plot visualizes the relationship between two variables, helping to assess whether a linear model is appropriate.
from sklearn.datasets import load_diabetes
# Load dataset
diabetes = load_diabetes()
X = diabetes.data[:, 2] # third feature as independent variable
y = diabetes.target
def simple_linear_regression(X, y):
X_mean = sum(X) / len(X)
y_mean = sum(y) / len(y)
numerator = sum((X - X_mean) * (y - y_mean))
denominator = sum((X - X_mean) ** 2)
slope = numerator / denominator
intercept = y_mean - slope * X_mean
return slope, intercept
slope, intercept = simple_linear_regression(X, y)Plot the scatter and the fitted regression line:
import matplotlib.pyplot as plt
plt.scatter(X, y, color='blue', label='Data points')
plt.plot(X, slope*X + intercept, color='red', label='Regression line')
plt.xlabel('X label')
plt.ylabel('y label')
plt.title('Scatter plot with regression line')
plt.legend()
plt.show()Linear Trend Line Plot
This plot adds a trend line to a scatter plot, making the overall linear relationship clearer, especially in noisy data.
import seaborn as sns
import matplotlib.pyplot as plt
sns.regplot(x=X, y=y, color='red', scatter_kws={'color':'blue','s':10})
plt.xlabel('X label')
plt.ylabel('y label')
plt.title('Linear trend line plot')
plt.show()Residual Plot
A residual plot shows the differences between observed and predicted values, helping to detect non‑random patterns that indicate model issues.
y_pred = slope * X + intercept
residuals = y - y_pred
plt.scatter(X, residuals, color='blue')
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('X label')
plt.ylabel('Residuals')
plt.title('Residual plot')
plt.show()Normal Probability Plot
This plot checks whether residuals follow a normal distribution, a key assumption for linear regression inference.
import scipy.stats as stats
import matplotlib.pyplot as plt
stats.probplot(residuals, dist='norm', plot=plt)
plt.xlabel('Theoretical quantiles')
plt.ylabel('Ordered residuals')
plt.title('Normal probability plot')
plt.show()Learning Curve
The learning curve displays training and cross‑validation scores as the number of training examples grows, revealing over‑ or under‑fitting.
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
model = LinearRegression()
train_sizes, train_scores, valid_scores = learning_curve(
model, X[:, np.newaxis], y, train_sizes=[50,100,200,300], cv=5)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
valid_mean = np.mean(valid_scores, axis=1)
valid_std = np.std(valid_scores, axis=1)
plt.fill_between(train_sizes, train_mean-train_std, train_mean+train_std, alpha=0.1, color='r')
plt.fill_between(train_sizes, valid_mean-valid_std, valid_mean+valid_std, alpha=0.1, color='g')
plt.plot(train_sizes, train_mean, 'o-', color='r', label='Training score')
plt.plot(train_sizes, valid_mean, 'o-', color='g', label='Cross-validation score')
plt.xlabel('Training examples')
plt.ylabel('Score')
plt.title('Learning curve')
plt.legend(loc='best')
plt.show()Bias‑Variance Tradeoff Plot
This plot visualizes how model complexity affects bias and variance, guiding the choice of an appropriate model.
from sklearn.model_selection import validation_curve
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
model = LinearRegression()
param_range = np.arange(1, 10)
train_scores, valid_scores = validation_curve(
model, X[:, np.newaxis], y, param_name='fit_intercept', param_range=param_range,
cv=5, scoring='neg_mean_squared_error')
train_mean = -np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
valid_mean = -np.mean(valid_scores, axis=1)
valid_std = np.std(valid_scores, axis=1)
plt.fill_between(param_range, train_mean-train_std, train_mean+train_std, alpha=0.1, color='r')
plt.fill_between(param_range, valid_mean-valid_std, valid_mean+valid_std, alpha=0.1, color='g')
plt.plot(param_range, train_mean, 'o-', color='r', label='Training score')
plt.plot(param_range, valid_mean, 'o-', color='g', label='Cross-validation score')
plt.xlabel('Model complexity')
plt.ylabel('Negative Mean Squared Error')
plt.title('Bias-variance tradeoff plot')
plt.legend(loc='best')
plt.show()Residuals vs Fitted Plot
This plot checks whether residuals are randomly distributed around zero across the range of fitted values.
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
model = LinearRegression()
model.fit(X.reshape(-1,1), y)
y_pred = model.predict(X.reshape(-1,1))
residuals = y - y_pred
plt.scatter(y_pred, residuals, color='blue')
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted plot')
plt.show()Partial Regression Plot
This plot isolates the effect of a single predictor while controlling for others, revealing its independent contribution.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X.reshape(-1,1))
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
plt.scatter(X[:,1], y, color='blue', label='Actual data')
plt.scatter(X[:,1], model.predict(X_poly), color='red', label='Predicted data')
plt.xlabel('Partial Regression')
plt.ylabel('Target')
plt.title('Partial Regression Plot')
plt.legend()
plt.show()Leverage Plot
Leverage plots identify points that have a disproportionate influence on the fitted regression coefficients.
import statsmodels.api as sm
import statsmodels.graphics.regressionplots as rp
import matplotlib.pyplot as plt
X_const = sm.add_constant(X)
model = sm.OLS(y, X_const).fit()
rp.plot_leverage_resid2(model)
plt.xlabel('Leverage')
plt.ylabel('Standardized Residuals Squared')
plt.title('Leverage Plot')
plt.show()Cook's Distance Plot
Cook's distance quantifies the influence of each observation on the overall regression fit, flagging potentially problematic points.
import statsmodels.api as sm
import matplotlib.pyplot as plt
X_const = sm.add_constant(X)
model = sm.OLS(y, X_const).fit()
influence = model.get_influence()
cook_dist = influence.cooks_distance[0]
plt.stem(cook_dist, markerfmt=',')
plt.xlabel('Data points')
plt.ylabel("Cook's distance")
plt.title("Cook's Distance Plot")
plt.show()Each of these visualizations provides a different perspective on model fit, assumptions, and data quality, enabling a comprehensive evaluation of linear regression models.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.