Scientific Data Definition, Application, Evaluation, and Explanation in Financial Risk Modeling
This presentation explores how to scientifically define, apply, evaluate, and interpret data in financial risk management, covering data alignment with business goals, feature selection, model metrics like KS and PSI, handling pandemic impacts, and methods for model explanation and improvement.
Data is the new energy and productivity in the information age, but leveraging massive, complex data—especially in finance—requires scientific approaches to align data with business objectives, select appropriate methods, evaluate model performance, and interpret results.
01 Scientific Definition of Data
1. Financial risk management: Credit business transforms savings into investment, similar to e‑commerce recommendation or ad targeting, aiming for precise risk matching between fund providers and borrowers.
2. Scientific definition of data: Annualized risk is defined as annualized bad amount divided by annualized balance; predicting annualized risk directly is difficult, so predicting overdue user distribution (MOB12) is more practical.
3. Relating model predictions to annualized risk: The ratio of annualized risk to overdue rate (MOB12) near 1 indicates balanced credit limits; deviations suggest over‑ or under‑allocation.
4. Defining overdue and good users: Overdue status varies over time; a 30‑day overdue threshold (N=30) is commonly used to label bad users, while longer observation windows affect sample size and relevance.
5. Determining observation windows: Using vintage curves to find a point where slope approaches zero; typically MOB=12 is chosen for medium‑term risk observation.
02 Scientific Application of Data
Data types usable in financial risk models include:
Credit reports: Historical credit records.
Internet data: Various online user data.
Third‑party fintech compliance data.
Behavior data from the product itself.
User perspectives are described as:
Basic attribute portrait: Age, gender, occupation, interests, etc., derived via ML/NLP.
Behavior sequence: Time‑ordered actions, modeled with RNNs.
Social relationships: Peer income/consumption, modeled with GNNs.
Simple model and feature examples (not detailed): Text data: Attention networks extract key information. Sequential data: RNNs predict future risk from repayment behavior. Relational data: Clustering and graph convolutional networks leverage neighbor information.
03 Scientific Evaluation of Data
Key model evaluation metrics:
KS (Kolmogorov‑Smirnov) statistic: Measures ranking ability of good vs. bad users; offline KS may be high but can decay online due to differing user sets.
PSI (Population Stability Index): Assesses distribution stability of predicted scores over time.
Swap‑in & swap‑out analysis: Compares overall overdue rates and approval rates between old and new models under equal volume conditions.
Model stability is crucial; stable score‑to‑risk mapping (e.g., 600‑650 score ≈ 1% overdue) should hold across months.
Reject inference: Assign scores to rejected users (e.g., replicate samples with weighted labels) to enrich training data and improve model applicability.
Customer segmentation: Hierarchical grouping by loan purpose, activity level, and industry/behavior to build specialized models when distinct differences exist.
04 Scientific Explanation of Data
Model explanation approaches:
V1 – Logistic Regression: Highly interpretable but limited feature capacity.
V2 – Decision Tree: Handles many features and non‑linearities but harder to interpret.
V3 – Two‑layer model: Sub‑models built from thousands of variables feed into a top‑level LR or shallow XGB, offering good top‑level interpretability while leveraging complex features.
The session concludes with thanks to the audience.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.