Hands‑On Feature Engineering with Pandas and Scikit‑Learn: Complete Code Walkthrough

This article walks through a full feature‑engineering pipeline using Pandas and Scikit‑Learn, covering data inspection, missing‑value imputation, categorical encoding, outlier handling, scaling, feature construction, selection, and a final Pipeline that prepares clean, predictive features for a logistic‑regression model.

DeepHub IMBA
DeepHub IMBA
DeepHub IMBA
Hands‑On Feature Engineering with Pandas and Scikit‑Learn: Complete Code Walkthrough

What Is Feature Engineering?

Feature engineering transforms raw data into meaningful input variables that improve model performance. The article starts by stating that clean, well‑engineered features often matter more than the choice of algorithm.

Step 1 – Exploratory Data Analysis (EDA)

Before engineering features, the data is inspected to understand its shape, missing values, and distributions.

import pandas as pd
import numpy as np

np.random.seed(42)

df = pd.DataFrame({
    'Age': [25, 30, np.nan, 40, 35, 120, 28],
    'Salary': [50000, 60000, 55000, 80000, np.nan, 1000000, 62000],
    'Gender': ['Male', 'Female', 'Female', np.nan, 'Male', 'Male', 'Female'],
    'City': ['NY', 'LA', 'NY', 'SF', np.nan, 'LA', 'SF'],
    'Experience': [1, 3, 2, 10, 7, 25, 4],
    'Date': pd.date_range(start='2024-01-01', periods=7),
    'Target': [0, 1, 0, 1, 0, 1, 0]
})

print(df)
print(df.head())
print(df.info())
print(df.describe())

Missing‑value counts are displayed with df.isnull().sum(), and a histogram of each column is plotted using Matplotlib.

import matplotlib.pyplot as plt
df.hist(figsize=(12, 10))
plt.show()

Step 2 – Missing‑Value Imputation

Missing values degrade accuracy, so three strategies are demonstrated.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
df['Age'] = imputer.fit_transform(df[['Age']])
cat_imputer = SimpleImputer(strategy='most_frequent')
df['City'] = cat_imputer.fit_transform(df[['City']])
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)
numeric_cols = df.select_dtypes(include=['int64', 'float64'])
df[numeric_cols.columns] = knn_imputer.fit_transform(numeric_cols)

Step 3 – Categorical Encoding

Models accept only numeric inputs.

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['Gender'] = encoder.fit_transform(df['Gender'])

One‑Hot encoding is shown both with Pandas and Scikit‑Learn.

df = pd.get_dummies(df, columns=['City'], drop_first=True)
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')

Step 4 – Outlier Detection and Handling

Outliers can distort learning.

Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['Salary'] < lower) | (df['Salary'] > upper)]
print(outliers)

Detected outliers are removed, and Winsorization is applied as an alternative.

df = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)]
from scipy.stats.mstats import winsorize
df['Salary'] = winsorize(df['Salary'], limits=[0.05, 0.05])

Step 5 – Feature Scaling and Normalization

Different magnitudes affect model performance. Three scalers are demonstrated.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
df[['Age', 'Salary']] = minmax.fit_transform(df[['Age', 'Salary']])
from sklearn.preprocessing import RobustScaler
robust = RobustScaler()
df[['Age', 'Salary']] = robust.fit_transform(df[['Age', 'Salary']])

Step 6 – Feature Construction and Transformation

New informative features are created.

df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform(df[['Age', 'Experience']])
df['Age_Group'] = pd.cut(
    df['Age'],
    bins=[0, 18, 35, 60, 100],
    labels=['Teen', 'Young', 'Adult', 'Senior']
)

Step 7 – Feature Selection

Selecting truly important features reduces over‑fitting and speeds up training.

import seaborn as sns
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True)
from sklearn.feature_selection import SelectKBest, f_classif
X = df.drop('Target', axis=1)
y = df['Target']
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=5)
X_rfe = rfe.fit_transform(X, y)

Step 8 – End‑to‑End Pipeline

A Scikit‑Learn Pipeline combines preprocessing and model training, mitigating data‑leakage risk.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X = df.drop('Target', axis=1)
y = df['Target']

numeric_features = ['Age', 'Salary', 'Experience']
categorical_features = ['Gender', 'City']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model_pipeline.fit(X_train, y_train)
print("Model trained successfully")

Conclusion

Feature engineering is the foundation of successful machine‑learning projects. By systematically cleaning, transforming, constructing, and selecting features, the resulting dataset yields higher accuracy and robustness than merely switching to a more complex algorithm.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Machine LearningFeature Engineeringdata preprocessingpipelinepandasscikit-learn
DeepHub IMBA
Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.