Fundamentals 16 min read

Master Variable Feature Engineering: IV, PSI, WOE & Dimensionality Reduction

Learn how to transform raw data into model-ready features through comprehensive preprocessing, variable derivation, dimensionality reduction, and the application of IV, PSI, and WOE metrics, with practical guidelines, thresholds, and visual examples to enhance risk model performance.

Instant Consumer Technology Team
Instant Consumer Technology Team
Instant Consumer Technology Team
Master Variable Feature Engineering: IV, PSI, WOE & Dimensionality Reduction

Overview

Feature engineering converts raw data into model‑understandable, useful features and is the most critical part of the model development workflow, directly determining the performance ceiling of a model.

1. Data Preprocessing – Preparing Clean Ingredients

Data preprocessing removes impurities from raw data. It includes three main steps:

Missing value handling : Delete variables with a high missing rate (e.g., >90%) or fill meaningful missing values using statistical measures (mean, median, mode), model‑based predictions, or a special placeholder such as -9999. In practice, special‑value filling is common.

Outlier handling : Outliers can be capped (e.g., replace values beyond the 1% and 99% quantiles with the respective quantile values) or binned. Tree‑based models often tolerate outliers, while logistic regression benefits from binning.

Standardization : Align feature scales using Z‑score (mean = 0, std = 1) or Min‑Max scaling to the [0, 1] range.

2. Variable Feature Derivation – Creating Super Ingredients

Deriving new features from raw fields adds business insight. Common derivations include sum/average, per‑transaction metrics, ratios, and concentration measures across dimensions such as amount, count, time, date, and type. Examples:

Multi‑period counts (e.g., queries in the last 3 months, last 6 months).

Amount per transaction (e.g., loan amount / number of loans).

Debt‑to‑income ratio.

Behavioral changes (e.g., spending increase over recent months).

Cross features (e.g., age × occupation to create “recent graduate” or “near‑retirement civil servant”).

While exhaustive combinatorial generation is possible, it can cause a dimensionality explosion; therefore, domain knowledge and iterative practice are essential.

3. Dimensionality Reduction – Extracting the Essence

When thousands of derived features exist, reduction techniques keep only the most predictive and independent ones.

Data‑quality filtering : Remove variables with excessive missing rates (e.g., >95% or >90%) after business‑specific analysis.

IV (Information Value) screening : IV measures a feature’s predictive power. Higher IV indicates stronger discrimination. Compute IV by binning a variable, calculating the proportion of good and bad customers in each bin, then summing WOE × (good% – bad%). Variables with IV < 0.02 are typically discarded; thresholds may be adjusted based on overall IV distribution.

PSI (Population Stability Index) screening : PSI compares the distribution of a variable between training and out‑of‑time (OOT) samples. PSI < 0.1 is generally acceptable for score stability; higher values suggest distribution shift and warrant variable removal.

Correlation removal : For the remaining variables (often < 300, sometimes < 100), compute pairwise correlations and retain only the variable with the highest IV within each highly correlated group to avoid multicollinearity.

4. IV Calculation Example

Consider a variable “age” binned into three groups. After counting good and bad customers per bin and overall, compute WOE for each bin as ln(good% / bad%) and then IV as the sum of (good% – bad%) × WOE. In the example, the total IV for age is 0.5468.

IV formula illustration
IV formula illustration

5. PSI Calculation Example

PSI uses a similar formula to IV but compares two samples (e.g., training vs. OOT). Example calculations show PSI = 0.009 (stable) and PSI = 0.2234 (unstable).

PSI examples
PSI examples

6. WOE Binning – The Standard Language of Risk Models

WOE (Weight of Evidence) transforms a raw feature into a continuous value reflecting risk concentration. Positive WOE indicates higher good‑customer proportion; negative WOE indicates higher bad‑customer concentration. WOE offers four key benefits:

Linearizes non‑linear relationships for logistic regression.

Provides a standardized scale for comparing variable importance.

Reduces sensitivity to outliers by using bin‑level statistics.

Handles missing values by treating them as a separate bin.

WOE binning principles include: (1) No binning for low‑cardinality categorical variables; (2) Bins must be ordered; (3) Each bin should contain at least 5% (or 3% in special cases) of the data; (4) Monotonicity of bad‑rate across bins; (5) No bin should be pure good or pure bad.

WOE binning illustration
WOE binning illustration

7. Final Variable Selection

After the above steps, the candidate pool typically shrinks to fewer than 100 variables, which become the final set for model training. Further selection can be performed using model‑based feature importance (e.g., XGBoost or Random Forest) by discarding variables with zero importance.

Conclusion

Feature engineering, IV/PSI/WOE calculations, and systematic dimensionality reduction together form the backbone of robust financial risk scoring models. Mastering these techniques bridges business logic and statistical modeling, delivering interpretable and high‑performing models.

feature engineeringPSIWOEdimensionality reductionIVRisk Modeling
Instant Consumer Technology Team
Written by

Instant Consumer Technology Team

Instant Consumer Technology Team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.