Artificial Intelligence 5 min read

Predicting Enterprise Exit Risk with 88 Features: A DataScience Club Solution

This article details a data‑science competition solution that predicts enterprise exit risk using 88 engineered features, LightGBM and XGBoost models with 5‑fold bagging, achieving a top‑25 ranking on the private leaderboard.

Baobao Algorithm Notes

Dec 20, 2017

Predicting Enterprise Exit Risk with 88 Features: A DataScience Club Solution

Background

Traditional enterprise credit assessment relies on financial statements and loan records, which are unavailable for many small and micro enterprises. The DataFountain competition provides anonymized behavioral data for millions of firms, requiring participants to predict the probability of future operational risk.

Data and Task

The dataset contains multiple behavioral footprints (event timestamps, frequencies, etc.) for a large sample of enterprises. The target is a binary label indicating whether the enterprise will experience poor operation, and participants must output a risk probability.

Solution Overview

The solution uses 88 raw features, which after one‑hot encoding result in about 60 effective columns. Feature engineering is performed with pandas, and two gradient‑boosting models (LightGBM and XGBoost) are trained. Predictions from five‑fold bagging of each model are averaged to form the final score.

Feature Engineering

Time‑based features : first and last occurrence timestamps for each event type, elapsed time between events.

Statistical features : counts, frequencies, rates, and other aggregations derived via groupby.

Composite features : arithmetic combinations (addition, subtraction, multiplication, division) of selected raw variables to capture non‑linear interactions.

Modeling

Two models are trained independently:

LightGBM with parameters:

objective='binary', metric='auc', learning_rate=0.05, num_leaves=31, feature_fraction=0.9, bagging_fraction=0.8, bagging_freq=5

XGBoost with parameters:

objective='binary:logistic', eval_metric='auc', eta=0.05, max_depth=6, subsample=0.8, colsample_bytree=0.9

Each model is trained on five stratified folds; the out‑of‑fold predictions are stored, and the final prediction is the average of the five‑fold predictions from both models (simple ensemble).

Implementation Details

All preprocessing and feature construction are implemented with pandas using vectorized operations and lambda functions for speed. One‑hot encoding is performed with pd.get_dummies. The codebase is modular: data_preprocess.py handles loading and feature creation, train.py runs the cross‑validation, and predict.py generates the submission file. The repository can be cloned with:

git clone https://github.com/YourOrg/enterprise-risk-prediction.git

Results

The described pipeline achieved a rank within the top 25 on the private leaderboard, confirming that the engineered features and the bagged ensemble improve predictive performance.

Resources

The full source code, including the feature‑engineering notebook and training scripts, is publicly available at the GitHub repository linked above.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

feature engineering XGBoost LightGBM data competition Enterprise risk

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.