Hands‑On Feature Engineering with Pandas and Scikit‑Learn: Complete Code Walkthrough
This article walks through a full feature‑engineering pipeline using Pandas and Scikit‑Learn, covering data inspection, missing‑value imputation, categorical encoding, outlier handling, scaling, feature construction, selection, and a final Pipeline that prepares clean, predictive features for a logistic‑regression model.
What Is Feature Engineering?
Feature engineering transforms raw data into meaningful input variables that improve model performance. The article starts by stating that clean, well‑engineered features often matter more than the choice of algorithm.
Step 1 – Exploratory Data Analysis (EDA)
Before engineering features, the data is inspected to understand its shape, missing values, and distributions.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
'Age': [25, 30, np.nan, 40, 35, 120, 28],
'Salary': [50000, 60000, 55000, 80000, np.nan, 1000000, 62000],
'Gender': ['Male', 'Female', 'Female', np.nan, 'Male', 'Male', 'Female'],
'City': ['NY', 'LA', 'NY', 'SF', np.nan, 'LA', 'SF'],
'Experience': [1, 3, 2, 10, 7, 25, 4],
'Date': pd.date_range(start='2024-01-01', periods=7),
'Target': [0, 1, 0, 1, 0, 1, 0]
})
print(df)
print(df.head())
print(df.info())
print(df.describe())Missing‑value counts are displayed with df.isnull().sum(), and a histogram of each column is plotted using Matplotlib.
import matplotlib.pyplot as plt
df.hist(figsize=(12, 10))
plt.show()Step 2 – Missing‑Value Imputation
Missing values degrade accuracy, so three strategies are demonstrated.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
df['Age'] = imputer.fit_transform(df[['Age']]) cat_imputer = SimpleImputer(strategy='most_frequent')
df['City'] = cat_imputer.fit_transform(df[['City']]) from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)
numeric_cols = df.select_dtypes(include=['int64', 'float64'])
df[numeric_cols.columns] = knn_imputer.fit_transform(numeric_cols)Step 3 – Categorical Encoding
Models accept only numeric inputs.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['Gender'] = encoder.fit_transform(df['Gender'])One‑Hot encoding is shown both with Pandas and Scikit‑Learn.
df = pd.get_dummies(df, columns=['City'], drop_first=True) from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')Step 4 – Outlier Detection and Handling
Outliers can distort learning.
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['Salary'] < lower) | (df['Salary'] > upper)]
print(outliers)Detected outliers are removed, and Winsorization is applied as an alternative.
df = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)] from scipy.stats.mstats import winsorize
df['Salary'] = winsorize(df['Salary'], limits=[0.05, 0.05])Step 5 – Feature Scaling and Normalization
Different magnitudes affect model performance. Three scalers are demonstrated.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']]) from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
df[['Age', 'Salary']] = minmax.fit_transform(df[['Age', 'Salary']]) from sklearn.preprocessing import RobustScaler
robust = RobustScaler()
df[['Age', 'Salary']] = robust.fit_transform(df[['Age', 'Salary']])Step 6 – Feature Construction and Transformation
New informative features are created.
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform(df[['Age', 'Experience']]) df['Age_Group'] = pd.cut(
df['Age'],
bins=[0, 18, 35, 60, 100],
labels=['Teen', 'Young', 'Adult', 'Senior']
)Step 7 – Feature Selection
Selecting truly important features reduces over‑fitting and speeds up training.
import seaborn as sns
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True) from sklearn.feature_selection import SelectKBest, f_classif
X = df.drop('Target', axis=1)
y = df['Target']
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y) from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=5)
X_rfe = rfe.fit_transform(X, y)Step 8 – End‑to‑End Pipeline
A Scikit‑Learn Pipeline combines preprocessing and model training, mitigating data‑leakage risk.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X = df.drop('Target', axis=1)
y = df['Target']
numeric_features = ['Age', 'Salary', 'Experience']
categorical_features = ['Gender', 'City']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
model_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(max_iter=1000))
])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model_pipeline.fit(X_train, y_train)
print("Model trained successfully")Conclusion
Feature engineering is the foundation of successful machine‑learning projects. By systematically cleaning, transforming, constructing, and selecting features, the resulting dataset yields higher accuracy and robustness than merely switching to a more complex algorithm.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeepHub IMBA
A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
