A Beginner’s Guide to Data Preprocessing for Machine Learning in Python
This tutorial walks beginners through the essential steps of data preprocessing for any machine learning model, covering library imports, dataset loading, handling missing values, encoding categorical features, splitting into train‑test sets, and applying feature scaling using Python’s scikit‑learn.
Data preprocessing is the essential first step in building any machine learning model, as it directly influences the model’s performance.
The article begins by introducing the three fundamental Python libraries—NumPy, Matplotlib, and Pandas—and shows how to import them:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pdIt then demonstrates loading a CSV file with Pandas and separating the feature matrix X and target vector y.
dataset = pd.read_csv('my_data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].valuesMissing values are handled using scikit-learn’s Imputer (or SimpleImputer ) with a mean‑filling strategy, followed by fitting and transforming the relevant columns.
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values=np.nan, strategy='mean', axis=0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])Categorical attributes are encoded first with LabelEncoder and then with OneHotEncoder to create dummy variables.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()The target vector y is also label‑encoded if it is categorical.
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)Next, the dataset is split into training and test sets using an 80/20 split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)Feature scaling is applied with StandardScaler to ensure all features share the same magnitude, fitting on the training set and transforming both training and test sets.
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)The article concludes by reminding readers that data cleaning, handling missing values, encoding categorical variables, and scaling are all crucial steps before training a model.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.