Predicting Membership Purchase with Logistic Regression: Feature Engineering, Model Training, Evaluation, and Deployment
This article presents a complete workflow for predicting whether users will purchase a membership using logistic regression, covering data collection, feature selection, handling imbalanced samples, model training, hyper‑parameter tuning, threshold optimization, evaluation metrics such as accuracy, precision, recall, AUC, lift, and finally deployment on a big‑data platform with PySpark.
1. Background
The membership paid model is a common monetization method in the Internet that brings high user loyalty and stickiness, helping e‑commerce applications increase revenue. To improve user experience and support fine‑grained operations, it is essential to identify valuable users among massive populations for targeted membership conversion. This article adopts logistic regression on user behavior features (login, protocol, product, transaction, etc.) to predict the probability of purchasing a membership and focuses outreach on high‑probability users.
2. Solution Design
2.1 Model Selection
The problem is a binary classification task (purchase vs. non‑purchase). Logistic regression is chosen for its strong business interpretability; the trained model provides a linear prediction formula useful for downstream analysis.
2.2 Implementation Steps
The main analysis and deployment workflow includes:
Feature engineering
Data preprocessing
Model generation and evaluation
Model prediction
Model effect collection
3. Implementation Details
3.1 Feature Selection
Based on membership benefits and business understanding, a set of features (see the figure) is selected. For positive samples (already members), feature calculation windows are set before the purchase; for negative samples (non‑members), windows are set up to the current time.
3.2 Data Preprocessing
Data Collection
SQL is used to aggregate massive data and arrange the indicators as columns.
Imbalanced Sample Handling
Since non‑member samples far outnumber member samples, random undersampling is performed to balance the classes.
import numpy as
import pandas as
# df_vip: member samples
# df_non_vip: non‑member samples
df_non_vip = df_non_vip.sample(n=df_vip.shape[0], replace=False, random_state=555)
df = pd.concat([df_vip, df_non_vip])3.3 Model Generation and Evaluation
Effective Feature Screening
From an initial set of 47 features, high‑collinearity variables are removed, leaving the final feature list used for modeling.
# Feature variables:
feature_columns = [
'_7d_login_days',
'_7d_login_cnt',
'buy_service_cnt',
'self_visit_days',
'protocol_cnt',
'month_avg_wc_trd_yuan'
]
# Target variable:
target_columns = ['is_vip']
columns = feature_columns + target_columns
feature_df = df[columns]Train‑Test Split
from sklearn.model_selection import train_test_split
X_columns = feature_columns
Y_columns = target_columns
X_train = df[X_columns].values
Y_train = df[Y_columns].values.ravel()
train_x, test_x, train_y, test_y = train_test_split(X_train, Y_train, test_size=0.3, random_state=666)Logistic Regression Training
Without hyper‑parameter tuning, the baseline model achieves accuracy 0.73, precision 0.75, recall 0.72.
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(class_weight='balanced', C=1, penalty='l2', max_iter=2000, random_state=666)
log_reg.fit(train_x, train_y)
pred_y_log = log_reg.predict(test_x)
pred_y_proba_log = log_reg.predict_proba(test_x)Grid Search Hyper‑Parameter Tuning
Grid search optimizes for recall; the best recall 0.98 is obtained with C = 0.0001, penalty = l2. The optimal decision threshold is found to be 0.555 (instead of the default 0.5).
from sklearn.model_selection import GridSearchCV, StratifiedKFold
model = LogisticRegression(class_weight='balanced', max_iter=2000, random_state=666)
C = [0.0001,0.001,0.01,0.1,1,10,100,1000]
penalty = ['l1','l2']
param_grid = dict(C=C, penalty=penalty)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=666)
grid_search = GridSearchCV(model, param_grid=param_grid, scoring='recall', cv=kfold, n_jobs=-1)
grid_result = grid_search.fit(train_x, train_y)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))Threshold Determination
# Compute KS and best threshold
fpr, tpr, thresholds = roc_curve(test_y, pred_y_proba_log[:,1], pos_label=1)
KS_max = 0
best_thr = 0
for i in range(len(fpr)):
if i == 0:
KS_max = tpr[i] - fpr[i]
best_thr = thresholds[i]
elif (tpr[i] - fpr[i] > KS_max):
KS_max = tpr[i] - fpr[i]
best_thr = thresholds[i]
print('Maximum KS:', KS_max)
print('Best threshold:', best_thr) # Retrain with optimal threshold
log_reg = LogisticRegression(class_weight='balanced', C=0.0001, penalty='l2', max_iter=2000, random_state=666)
log_reg.fit(train_x, train_y)
pred_y_proba_log = log_reg.predict_proba(test_x)
pred_y_log = [1 if prob >= 0.555 else 0 for prob in pred_y_proba_log[:,1]]Model Evaluation
On the test set, the model achieves accuracy 0.83, precision 0.80, recall 0.88, F1 0.83, and AUC 0.90. Lift values range from 1.22 to 1.67.
from sklearn.metrics import classification_report
print('Logistic Regression:\n', classification_report(test_y, pred_y_log))3.4 Model Prediction Deployment
The model is deployed via a big‑data scheduling tool (PySpark) to handle massive data and provide real‑time predictions that are pushed to downstream systems such as CRM and tagging platforms.
from pyhive import presto
import numpy as, pandas as pd, os, datetime
from datetime import timedelta
import time, re, json, sys, pyhdfs
from math import log
import sklearn
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder, StandardScaler, DictVectorizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.cluster import KMeans
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split, learning_curve
from sklearn.metrics import (accuracy_score, brier_score_loss, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_curve, auc)
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, lit, monotonically_increasing_id
from pyspark.sql.types import StringType, ArrayType
import pickle
# Feature columns used for prediction
feature_columns = ['_7d_login_days','_7d_login_cnt','buy_service_cnt','self_visit_days','protocol_cnt','month_avg_wc_GMV_yuan']3.5 Model Effect Collection
After online deployment, the model’s actual performance and accuracy are evaluated using business‑side outreach data, confirming a good effect.
3.6 Conclusion
The logistic regression algorithm provides accurate predictions that empower business decisions, identifies key features influencing membership purchase, and suggests future iterations such as user segmentation, profiling, and integration with CRM for targeted marketing.
政采云技术
ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.