Artificial Intelligence 11 min read

A Beginner’s Guide to Data Preprocessing for Machine Learning in Python

This tutorial walks beginners through the essential steps of data preprocessing for any machine learning model, covering library imports, dataset loading, handling missing values, encoding categorical features, splitting into train‑test sets, and applying feature scaling using Python’s scikit‑learn.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
A Beginner’s Guide to Data Preprocessing for Machine Learning in Python

Data preprocessing is the essential first step in building any machine learning model, as it directly influences the model’s performance.

The article begins by introducing the three fundamental Python libraries—NumPy, Matplotlib, and Pandas—and shows how to import them:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

It then demonstrates loading a CSV file with Pandas and separating the feature matrix X and target vector y.

dataset = pd.read_csv('my_data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

Missing values are handled using scikit-learn’s Imputer (or SimpleImputer ) with a mean‑filling strategy, followed by fitting and transforming the relevant columns.

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values=np.nan, strategy='mean', axis=0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

Categorical attributes are encoded first with LabelEncoder and then with OneHotEncoder to create dummy variables.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()

The target vector y is also label‑encoded if it is categorical.

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

Next, the dataset is split into training and test sets using an 80/20 split.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Feature scaling is applied with StandardScaler to ensure all features share the same magnitude, fitting on the training set and transforming both training and test sets.

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

The article concludes by reminding readers that data cleaning, handling missing values, encoding categorical variables, and scaling are all crucial steps before training a model.

machine learningPythondata preprocessingfeature scalingscikit-learnmissing valuesone-hot encoding
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.