Predicting Enterprise Exit Risk with 88 Features: A DataScience Club Solution
This article details a data‑science competition solution that predicts enterprise exit risk using 88 engineered features, LightGBM and XGBoost models with 5‑fold bagging, achieving a top‑25 ranking on the private leaderboard.
Background
Traditional enterprise credit assessment relies on financial statements and loan records, which are unavailable for many small and micro enterprises. The DataFountain competition provides anonymized behavioral data for millions of firms, requiring participants to predict the probability of future operational risk.
Data and Task
The dataset contains multiple behavioral footprints (event timestamps, frequencies, etc.) for a large sample of enterprises. The target is a binary label indicating whether the enterprise will experience poor operation, and participants must output a risk probability.
Solution Overview
The solution uses 88 raw features, which after one‑hot encoding result in about 60 effective columns. Feature engineering is performed with pandas, and two gradient‑boosting models (LightGBM and XGBoost) are trained. Predictions from five‑fold bagging of each model are averaged to form the final score.
Feature Engineering
Time‑based features : first and last occurrence timestamps for each event type, elapsed time between events.
Statistical features : counts, frequencies, rates, and other aggregations derived via groupby.
Composite features : arithmetic combinations (addition, subtraction, multiplication, division) of selected raw variables to capture non‑linear interactions.
Modeling
Two models are trained independently:
LightGBM with parameters:
objective='binary', metric='auc', learning_rate=0.05, num_leaves=31, feature_fraction=0.9, bagging_fraction=0.8, bagging_freq=5XGBoost with parameters:
objective='binary:logistic', eval_metric='auc', eta=0.05, max_depth=6, subsample=0.8, colsample_bytree=0.9Each model is trained on five stratified folds; the out‑of‑fold predictions are stored, and the final prediction is the average of the five‑fold predictions from both models (simple ensemble).
Implementation Details
All preprocessing and feature construction are implemented with pandas using vectorized operations and lambda functions for speed. One‑hot encoding is performed with pd.get_dummies. The codebase is modular: data_preprocess.py handles loading and feature creation, train.py runs the cross‑validation, and predict.py generates the submission file. The repository can be cloned with:
git clone https://github.com/YourOrg/enterprise-risk-prediction.gitResults
The described pipeline achieved a rank within the top 25 on the private leaderboard, confirming that the engineered features and the bagged ensemble improve predictive performance.
Resources
The full source code, including the feature‑engineering notebook and training scripts, is publicly available at the GitHub repository linked above.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
