Model Performance Lagging? Master Feature Engineering with a Complete Step‑by‑Step Guide

This article walks through the entire feature‑engineering pipeline—data cleaning, missing‑value imputation, encoding, outlier handling, scaling, feature construction, and selection—using Pandas and Scikit‑learn, and shows how to wrap the steps into a reproducible Scikit‑learn Pipeline.

Data Party THU
Data Party THU
Data Party THU
Model Performance Lagging? Master Feature Engineering with a Complete Step‑by‑Step Guide

What Is Feature Engineering?

Feature engineering transforms raw data into meaningful input variables that improve machine‑learning model performance. Poor or noisy features limit even the best algorithms.

Step 1 – Exploratory Data Analysis (EDA)

Before creating features, understand the data.

import pandas as pd
import numpy as np
np.random.seed(42)

df = pd.DataFrame({
    'Age': [25, 30, np.nan, 40, 35, 120, 28],
    'Salary': [50000, 60000, 55000, 80000, np.nan, 1000000, 62000],
    'Gender': ['Male', 'Female', 'Female', np.nan, 'Male', 'Male', 'Female'],
    'City': ['NY', 'LA', 'NY', 'SF', np.nan, 'LA', 'SF'],
    'Experience': [1, 3, 2, 10, 7, 25, 4],
    'Date': pd.date_range(start='2024-01-01', periods=7),
    'Target': [0, 1, 0, 1, 0, 1, 0]
})
print(df.head())
print(df.info())
print(df.describe())

Check missing values and visualize distributions:

print(df.isnull().sum())
import matplotlib.pyplot as plt
df.hist(figsize=(12,10))
plt.show()

Step 2 – Missing‑Value Imputation

Missing data degrades accuracy. The article demonstrates three strategies.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
df['Age'] = imputer.fit_transform(df[['Age']])
cat_imputer = SimpleImputer(strategy='most_frequent')
df['City'] = cat_imputer.fit_transform(df[['City']])
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)
numeric_cols = df.select_dtypes(include=['int64','float64'])
df[numeric_cols.columns] = knn_imputer.fit_transform(numeric_cols)

Step 3 – Categorical Encoding

Models accept only numeric inputs.

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['Gender'] = encoder.fit_transform(df['Gender'])
df = pd.get_dummies(df, columns=['City'], drop_first=True)
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')

Step 4 – Outlier Detection & Treatment

Outliers can distort learning. The article uses the IQR method and Winsorization.

Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['Salary'] < lower) | (df['Salary'] > upper)]
print(outliers)
df = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)]
from scipy.stats.mstats import winsorize
df['Salary'] = winsorize(df['Salary'], limits=[0.05, 0.05])

Step 5 – Feature Scaling & Normalization

Different scales affect model performance.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age','Salary']] = scaler.fit_transform(df[['Age','Salary']])
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
df[['Age','Salary']] = minmax.fit_transform(df[['Age','Salary']])
from sklearn.preprocessing import RobustScaler
robust = RobustScaler()
df[['Age','Salary']] = robust.fit_transform(df[['Age','Salary']])

Step 6 – Feature Construction & Transformation

Creating informative features often yields the biggest accuracy gains.

# Date‑time features
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
# Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform(df[['Age','Experience']])
# Binning numeric variable
df['Age_Group'] = pd.cut(df['Age'], bins=[0,18,35,60,100],
    labels=['Teen','Young','Adult','Senior'])

Step 7 – Feature Selection

Selecting truly important features reduces over‑fitting and speeds up training.

import seaborn as sns
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True)
from sklearn.feature_selection import SelectKBest, f_classif
X = df.drop('Target', axis=1)
y = df['Target']
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=5)
X_rfe = rfe.fit_transform(X, y)

Full Pipeline Example

Combining preprocessing steps into a Scikit‑learn Pipeline prevents data leakage.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

numeric_features = ['Age','Salary','Experience']
categorical_features = ['Gender','City']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

X = df.drop('Target', axis=1)
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
model_pipeline.fit(X_train, y_train)
print("Model trained successfully")

Conclusion

Feature engineering is the foundation of any successful machine‑learning project. Clean, transformed, and informative features often outweigh the choice of a sophisticated algorithm. Following the steps above reliably lifts model accuracy and robustness.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningfeature engineeringdata preprocessingPipelinepandasscikit-learn
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.