Applying AI Techniques to Credit Reporting and Risk Modeling: Model Structure, Pre‑training, Ranking and Interpretability
This article presents a comprehensive overview of how AI technologies are applied to credit reporting and loan risk modeling, detailing data characteristics, end‑to‑end model architectures, pre‑training strategies, risk‑ranking methods, and interpretability techniques for financial risk assessment.
Background
Credit data in China is provided by the People's Bank of China and includes personal basic information, loan transaction details, non‑loan credit information, and query records. The report aggregates six major aspects, with the four core blocks being personal basics, loan details, non‑loan credit (e.g., housing fund contributions), and query logs.
Credit Model Scenario
Credit models heavily rely on credit data. Traditional scoring‑card models, built on expert‑engineered features, offer good interpretability but lower performance compared to complex models. Complex approaches include extensive feature engineering, end‑to‑end models that directly ingest raw data, and hybrid methods. The focus of this talk is on end‑to‑end models, which have shown the best performance among the three approaches.
Main Content
The presentation covers four parts: (1) model structure optimization for credit data, (2) pre‑training methods to boost performance, (3) application of risk‑ranking models, and (4) interpretability of complex models.
1. Model Structure Optimization
Four models (Model1‑Model4) are introduced.
Model1
Model1 addresses the semi‑structured nature of credit reports by integrating numerical and categorical basic features, applying self‑attention across loan and credit‑card sequences, and using multi‑head attention for textual fields. Shallow transformer layers performed better than deeper ones due to sparse text signals.
Model2
Model2 adds temporal trend modeling by encoding loan and credit‑card histories as separate sequences, then concatenating them into a unified sequence and applying a session‑level sequential model to capture time‑dependent patterns.
Model3
Model3 improves the representation of repayment sequences by nesting monthly repayment status under each loan, then cross‑integrating these with basic information to better capture repayment trends.
Model4
Model4 leverages graph neural networks to enrich sparse textual fields (e.g., addresses, company names) by constructing an association network that links entities to related users and external knowledge, enabling richer risk signals from otherwise rare tokens.
2. Pre‑training Optimization
Inspired by BERT, a masked‑language‑model style pre‑training is applied to credit reports. Because of strong intra‑feature correlations, naive masking yields poor results. The solution is to discretize and jointly encode correlated fields, then predict masked groups using a hierarchical softmax that clusters similar targets, leading to significant gains over non‑pre‑trained baselines.
3. Risk‑Ranking Model Application
Instead of optimizing classification accuracy, the model is trained to improve the ranking of risky users using AUC/KS metrics. By treating overdue users as a sorted list (earlier overdue = higher risk), the approach better distinguishes short‑term defaults and can incorporate distillation from pandemic‑specific models.
4. Interpretability of Complex Models
Two main interpretability methods are discussed: Integrated Gradients (IG) and SHAP. IG integrates gradients from a baseline to the sample, while SHAP approximates Shapley values, especially effective for tree models. Both help explain feature contributions despite the black‑box nature of deep models.
Q&A
Questions covered encoding of textual fields, handling categorical codes like N123C, sample definition (one report per user), and building neural architectures for mixed state and behavior data.
Thank you for attending.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.