Artificial Intelligence 14 min read

Scikit-learn Tutorial: Supervised Learning with Linear Regression

This article provides a comprehensive guide to using Python's scikit-learn library for supervised learning, focusing on linear regression, covering theoretical background, environment setup, data preprocessing, model training, evaluation with mean squared error, cross‑validation, and detailed code examples.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Scikit-learn Tutorial: Supervised Learning with Linear Regression

This tutorial introduces the fundamentals of machine learning with scikit-learn, emphasizing supervised learning through linear regression. It begins with a brief motivation for machine learning, explaining how algorithms can replace costly expert systems by learning from user behavior.

Environment preparation : Install Python 2.7.15 (or a newer version) on Windows, add it to the system PATH, and verify the installation with pip2 --version . Then install essential scientific packages (NumPy, SciPy, pandas) and the scikit-learn library from the provided wheel links.

IDE setup : Configure IntelliJ IDEA with the Python plugin, create a new Python project, and add a .py file for the code.

Linear regression basics : The goal is to find a line f(x) = kx + b (or a hyperplane for multivariate data) that minimizes the sum of squared distances between samples and the line.

Key Python code snippets :

import matplotlib.pyplot as plt
import pandas as pd

data = pd.read_csv('D:\python\20180606\Folds5x2_pp.csv')

# Prepare features and target
X = data[['AT', 'V', 'AP', 'RH']]
Y = data[['PE']]

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=1)

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, Y_train)

print(linreg.intercept_)  # e.g., [460.05727267]
print(linreg.coef_)      # e.g., [[-1.96865472 -0.2392946  0.0568509 -0.15861467]]

Y_pred = linreg.predict(X_test)

from sklearn import metrics
print('MSE:', metrics.mean_squared_error(Y_test, Y_pred))

# Cross‑validation
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(linreg, X, Y, cv=10)
print('MSE (CV):', metrics.mean_squared_error(Y, predicted))

The code demonstrates loading a CSV dataset, selecting four sensor features (AT, V, AP, RH) to predict power output (PE), splitting the data into training and test sets, fitting a LinearRegression model, and evaluating performance using mean squared error (MSE).

Model evaluation : MSE is used as the primary metric for regression tasks. The tutorial shows how to compute MSE on the test set and how to assess the effect of removing individual features by re‑computing MSE.

Cross‑validation : Using cross_val_predict with cv=10 provides predictions for each sample based on models trained on the remaining folds, allowing a more robust estimate of model performance.

Linear regression theory : The article explains the mathematical formulation of linear regression, the least‑squares objective, derivation of optimal weights w and bias b , and the matrix solution w = (X^T X)^{-1} X^T Y . It also discusses cases where the feature matrix is not full rank and the role of regularization.

Overall, the tutorial combines conceptual explanations, practical environment setup, and end‑to‑end Python code to enable readers to build, evaluate, and understand linear regression models with scikit-learn.

Pythonmodel evaluationlinear regressionsupervised learningscikit-learncross-validation
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.