Master Variable Feature Engineering: IV, PSI, WOE & Dimensionality Reduction
Learn how to transform raw data into model-ready features through comprehensive preprocessing, variable derivation, dimensionality reduction, and the application of IV, PSI, and WOE metrics, with practical guidelines, thresholds, and visual examples to enhance risk model performance.
Overview
Feature engineering converts raw data into model‑understandable, useful features and is the most critical part of the model development workflow, directly determining the performance ceiling of a model.
1. Data Preprocessing – Preparing Clean Ingredients
Data preprocessing removes impurities from raw data. It includes three main steps:
Missing value handling : Delete variables with a high missing rate (e.g., >90%) or fill meaningful missing values using statistical measures (mean, median, mode), model‑based predictions, or a special placeholder such as -9999. In practice, special‑value filling is common.
Outlier handling : Outliers can be capped (e.g., replace values beyond the 1% and 99% quantiles with the respective quantile values) or binned. Tree‑based models often tolerate outliers, while logistic regression benefits from binning.
Standardization : Align feature scales using Z‑score (mean = 0, std = 1) or Min‑Max scaling to the [0, 1] range.
2. Variable Feature Derivation – Creating Super Ingredients
Deriving new features from raw fields adds business insight. Common derivations include sum/average, per‑transaction metrics, ratios, and concentration measures across dimensions such as amount, count, time, date, and type. Examples:
Multi‑period counts (e.g., queries in the last 3 months, last 6 months).
Amount per transaction (e.g., loan amount / number of loans).
Debt‑to‑income ratio.
Behavioral changes (e.g., spending increase over recent months).
Cross features (e.g., age × occupation to create “recent graduate” or “near‑retirement civil servant”).
While exhaustive combinatorial generation is possible, it can cause a dimensionality explosion; therefore, domain knowledge and iterative practice are essential.
3. Dimensionality Reduction – Extracting the Essence
When thousands of derived features exist, reduction techniques keep only the most predictive and independent ones.
Data‑quality filtering : Remove variables with excessive missing rates (e.g., >95% or >90%) after business‑specific analysis.
IV (Information Value) screening : IV measures a feature’s predictive power. Higher IV indicates stronger discrimination. Compute IV by binning a variable, calculating the proportion of good and bad customers in each bin, then summing WOE × (good% – bad%). Variables with IV < 0.02 are typically discarded; thresholds may be adjusted based on overall IV distribution.
PSI (Population Stability Index) screening : PSI compares the distribution of a variable between training and out‑of‑time (OOT) samples. PSI < 0.1 is generally acceptable for score stability; higher values suggest distribution shift and warrant variable removal.
Correlation removal : For the remaining variables (often < 300, sometimes < 100), compute pairwise correlations and retain only the variable with the highest IV within each highly correlated group to avoid multicollinearity.
4. IV Calculation Example
Consider a variable “age” binned into three groups. After counting good and bad customers per bin and overall, compute WOE for each bin as ln(good% / bad%) and then IV as the sum of (good% – bad%) × WOE. In the example, the total IV for age is 0.5468.
5. PSI Calculation Example
PSI uses a similar formula to IV but compares two samples (e.g., training vs. OOT). Example calculations show PSI = 0.009 (stable) and PSI = 0.2234 (unstable).
6. WOE Binning – The Standard Language of Risk Models
WOE (Weight of Evidence) transforms a raw feature into a continuous value reflecting risk concentration. Positive WOE indicates higher good‑customer proportion; negative WOE indicates higher bad‑customer concentration. WOE offers four key benefits:
Linearizes non‑linear relationships for logistic regression.
Provides a standardized scale for comparing variable importance.
Reduces sensitivity to outliers by using bin‑level statistics.
Handles missing values by treating them as a separate bin.
WOE binning principles include: (1) No binning for low‑cardinality categorical variables; (2) Bins must be ordered; (3) Each bin should contain at least 5% (or 3% in special cases) of the data; (4) Monotonicity of bad‑rate across bins; (5) No bin should be pure good or pure bad.
7. Final Variable Selection
After the above steps, the candidate pool typically shrinks to fewer than 100 variables, which become the final set for model training. Further selection can be performed using model‑based feature importance (e.g., XGBoost or Random Forest) by discarding variables with zero importance.
Conclusion
Feature engineering, IV/PSI/WOE calculations, and systematic dimensionality reduction together form the backbone of robust financial risk scoring models. Mastering these techniques bridges business logic and statistical modeling, delivering interpretable and high‑performing models.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
