Detecting Low‑Quality New Users in Food Delivery with a GBDT + LR Model
The article describes a data‑driven approach for identifying low‑value new users in a food‑delivery platform by labeling 7‑day repeat‑purchase behavior, extracting order, behavior, merchant and user features, and training a combined Gradient Boosted Decision Tree and Logistic Regression model to improve fraud detection and merchant penalty decisions.
Background In the food‑delivery scenario, acquiring new users is costly and some merchants use low‑price or fraudulent tactics to boost KPI and subsidies, so the platform needs to identify low‑value new users for possible penalties.
Key Terms "Acquisition" (拉新) refers to bringing new users via various means, while "Repeat Purchase" (复购) counts a user’s second purchase on a different day as a repeat, regardless of multiple purchases on the same day.
Quality Judgment Standard The metric is the 7‑day repeat‑purchase rate of new users, calculated as the number of new users who purchase at least twice within seven days (excluding multiple orders on the same day) divided by the total number of new users. Merchants whose rate falls below a threshold are considered inefficient at acquisition.
Overall Process The workflow includes data collection, labeling, feature extraction, model training, and deployment, as illustrated in the accompanying diagram.
Data Collection and Labeling Training samples are gathered from the past three months, labeling a user as 1 if they repeat‑purchase within seven days, otherwise 0. Test samples are collected from the previous week with the same binary labels.
Feature Extraction Features are derived from orders, user behavior, merchant attributes, and user profiles.
Model Selection A Gradient Boosted Decision Tree (GBDT) is first trained to generate leaf‑node identifiers, which serve as high‑dimensional features for a Logistic Regression (LR) model. This GBDT + LR combination leverages GBDT’s ability to capture non‑linear interactions and automatically produce useful feature combinations for LR.
Training Steps 1. Train a GBDT model using the labeled training data. 2. Pass each sample through the trained GBDT; the leaf nodes reached become one‑dimensional LR features. 3. Train an LR model on these derived features.
Advantages of GBDT + LR The GBDT component discovers discriminative feature interactions and combinations that linear models alone cannot capture, reducing reliance on manual feature engineering while maintaining interpretability for the LR layer.
Deployment and Outcome The final repeat‑purchase probability model is applied to merchants; users with predicted probabilities below a set threshold are flagged as unlikely to repurchase, enabling targeted penalties for merchants with low‑quality new users.
Summary and Outlook The current model achieves 97% precision and 43% recall in identifying low‑repeat‑purchase merchants. Although precision is high, recall is modest, prompting future work to incorporate additional dimensions and improve detection coverage.
Author Introduction The author is a core member of Baidu Waimai Risk Control Team, responsible for risk‑control algorithms and strategies since 2015, focusing on merchant and BD risk mitigation.
Baidu Waimai Technology Team
The Baidu Waimai Technology Team supports and drives the company's business growth. This account provides a platform for engineers to communicate, share, and learn. Follow us for team updates, top technical articles, and internal/external open courses.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.