Model Performance Lagging? Master Feature Engineering with a Complete Step‑by‑Step Guide
This article walks through the entire feature‑engineering pipeline—data cleaning, missing‑value imputation, encoding, outlier handling, scaling, feature construction, and selection—using Pandas and Scikit‑learn, and shows how to wrap the steps into a reproducible Scikit‑learn Pipeline.
What Is Feature Engineering?
Feature engineering transforms raw data into meaningful input variables that improve machine‑learning model performance. Poor or noisy features limit even the best algorithms.
Step 1 – Exploratory Data Analysis (EDA)
Before creating features, understand the data.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
'Age': [25, 30, np.nan, 40, 35, 120, 28],
'Salary': [50000, 60000, 55000, 80000, np.nan, 1000000, 62000],
'Gender': ['Male', 'Female', 'Female', np.nan, 'Male', 'Male', 'Female'],
'City': ['NY', 'LA', 'NY', 'SF', np.nan, 'LA', 'SF'],
'Experience': [1, 3, 2, 10, 7, 25, 4],
'Date': pd.date_range(start='2024-01-01', periods=7),
'Target': [0, 1, 0, 1, 0, 1, 0]
})
print(df.head())
print(df.info())
print(df.describe())Check missing values and visualize distributions:
print(df.isnull().sum())
import matplotlib.pyplot as plt
df.hist(figsize=(12,10))
plt.show()Step 2 – Missing‑Value Imputation
Missing data degrades accuracy. The article demonstrates three strategies.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
df['Age'] = imputer.fit_transform(df[['Age']]) cat_imputer = SimpleImputer(strategy='most_frequent')
df['City'] = cat_imputer.fit_transform(df[['City']]) from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)
numeric_cols = df.select_dtypes(include=['int64','float64'])
df[numeric_cols.columns] = knn_imputer.fit_transform(numeric_cols)Step 3 – Categorical Encoding
Models accept only numeric inputs.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['Gender'] = encoder.fit_transform(df['Gender']) df = pd.get_dummies(df, columns=['City'], drop_first=True) from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')Step 4 – Outlier Detection & Treatment
Outliers can distort learning. The article uses the IQR method and Winsorization.
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['Salary'] < lower) | (df['Salary'] > upper)]
print(outliers) df = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)]
from scipy.stats.mstats import winsorize
df['Salary'] = winsorize(df['Salary'], limits=[0.05, 0.05])Step 5 – Feature Scaling & Normalization
Different scales affect model performance.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age','Salary']] = scaler.fit_transform(df[['Age','Salary']]) from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
df[['Age','Salary']] = minmax.fit_transform(df[['Age','Salary']]) from sklearn.preprocessing import RobustScaler
robust = RobustScaler()
df[['Age','Salary']] = robust.fit_transform(df[['Age','Salary']])Step 6 – Feature Construction & Transformation
Creating informative features often yields the biggest accuracy gains.
# Date‑time features
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day # Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform(df[['Age','Experience']]) # Binning numeric variable
df['Age_Group'] = pd.cut(df['Age'], bins=[0,18,35,60,100],
labels=['Teen','Young','Adult','Senior'])Step 7 – Feature Selection
Selecting truly important features reduces over‑fitting and speeds up training.
import seaborn as sns
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True) from sklearn.feature_selection import SelectKBest, f_classif
X = df.drop('Target', axis=1)
y = df['Target']
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y) from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=5)
X_rfe = rfe.fit_transform(X, y)Full Pipeline Example
Combining preprocessing steps into a Scikit‑learn Pipeline prevents data leakage.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
numeric_features = ['Age','Salary','Experience']
categorical_features = ['Gender','City']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
model_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(max_iter=1000))
])
X = df.drop('Target', axis=1)
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
model_pipeline.fit(X_train, y_train)
print("Model trained successfully")Conclusion
Feature engineering is the foundation of any successful machine‑learning project. Clean, transformed, and informative features often outweigh the choice of a sophisticated algorithm. Following the steps above reliably lifts model accuracy and robustness.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
