File Release Application Prediction Model Using GBDT
This article describes how a GBDT‑based prediction model was built to forecast file release application parameters such as volume ratio, target audience, and gray level, covering data collection, feature engineering, model training, service deployment, and practical considerations for handling bad cases.
The 360 Process Management System's file release module manages new and updated file deployments, but manual entry of volume ratios, targets, and gray levels often leads to inefficiencies. To address this, historical data and a Gradient Boosting Decision Tree (GBDT) model were employed to predict these parameters, reducing workload.
Dataset Construction: SQL queries extracted historical file release and adjustment requests. Noisy, unlabeled records were removed, resulting in a clean dataset for modeling.
Feature Selection and Processing: Relevant fields such as file name, release mode, current volume, ratio, V5 condition, and priority were retained, while irrelevant ones like unique release paths were discarded. Categorical features were encoded (e.g., unique IDs for file names) and numerical features were standardized, with units unified (e.g., gray level converted to a single unit).
GBDT Algorithm: Separate models were trained for new releases and adjustments due to differing feature spaces. For each release type, three sub‑models predict target audience, volume ratio, and gray level respectively, with the ratio model also using the predicted audience as input. In total, six models were built.
def train_and_test_online_datas_by_GBDT(datas, labels):
train_datas, test_datas, train_targets, test_targets, \
train_rates, test_rates, train_grays, test_grays = __split_train_test(datas, labels)
print test_datas[1], test_targets[1], test_rates[1], test_grays[1]
print "**************target***********************"
__train_datas = train_datas
__test_datas = test_datas
target_clf = GradientBoostingClassifier()
target_clf.fit(__train_datas, train_targets)
print target_clf.score(__train_datas, train_targets)
print target_clf.score(__test_datas, test_targets)
print "**************rate***********************"
train_datas = np.insert(train_datas, 1, values=train_targets, axis=1)
test_datas = np.insert(test_datas, 1, values=test_targets, axis=1)
__train_datas = train_datas
__test_datas = test_datas
rate_clf = GradientBoostingClassifier()
rate_clf.fit(__train_datas, train_rates)
print rate_clf.score(__train_datas, train_rates)
print rate_clf.score(__test_datas, test_rates)
print "**************gray***********************"
train_datas = np.insert(train_datas, 2, values=train_rates, axis=1)
test_datas = np.insert(test_datas, 2, values=test_rates, axis=1)
__train_datas = train_datas
__test_datas = test_datas
gray_clf = GradientBoostingClassifier()
gray_clf.fit(__train_datas, train_grays)
print gray_clf.score(__train_datas, train_grays)
print gray_clf.score(__test_datas, test_grays)Prediction Service Deployment: The trained models were wrapped into a Tornado‑based web service, exposing APIs for the existing workflow system. Requests are cached to reduce latency, and the model is retrained weekly with new data to adapt to evolving release patterns.
Summary & Reflections: In production, occasional "bad cases" arise where predictions conflict with business logic (e.g., predicting a "formal" release instead of "full‑network"). Post‑processing rules and incremental data labeling are used to correct such errors, allowing the model to improve over time.
360 Quality & Efficiency
360 Quality & Efficiency focuses on seamlessly integrating quality and efficiency in R&D, sharing 360’s internal best practices with industry peers to foster collaboration among Chinese enterprises and drive greater efficiency value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.