Artificial Intelligence 17 min read

Predicting Membership Purchase with Logistic Regression: Feature Engineering, Model Training, Evaluation, and Deployment

This article presents a complete workflow for predicting whether users will purchase a membership using logistic regression, covering data collection, feature selection, handling imbalanced samples, model training, hyper‑parameter tuning, threshold optimization, evaluation metrics such as accuracy, precision, recall, AUC, lift, and finally deployment on a big‑data platform with PySpark.

政采云技术

Oct 10, 2023

Predicting Membership Purchase with Logistic Regression: Feature Engineering, Model Training, Evaluation, and Deployment

1. Background

The membership paid model is a common monetization method in the Internet that brings high user loyalty and stickiness, helping e‑commerce applications increase revenue. To improve user experience and support fine‑grained operations, it is essential to identify valuable users among massive populations for targeted membership conversion. This article adopts logistic regression on user behavior features (login, protocol, product, transaction, etc.) to predict the probability of purchasing a membership and focuses outreach on high‑probability users.

2. Solution Design

2.1 Model Selection

The problem is a binary classification task (purchase vs. non‑purchase). Logistic regression is chosen for its strong business interpretability; the trained model provides a linear prediction formula useful for downstream analysis.

2.2 Implementation Steps

The main analysis and deployment workflow includes:

Feature engineering

Data preprocessing

Model generation and evaluation

Model prediction

Model effect collection

3. Implementation Details

3.1 Feature Selection

Based on membership benefits and business understanding, a set of features (see the figure) is selected. For positive samples (already members), feature calculation windows are set before the purchase; for negative samples (non‑members), windows are set up to the current time.

3.2 Data Preprocessing

Data Collection

SQL is used to aggregate massive data and arrange the indicators as columns.

Imbalanced Sample Handling

Since non‑member samples far outnumber member samples, random undersampling is performed to balance the classes.

import numpy as
import pandas as
# df_vip: member samples
# df_non_vip: non‑member samples
df_non_vip = df_non_vip.sample(n=df_vip.shape[0], replace=False, random_state=555)
df = pd.concat([df_vip, df_non_vip])

3.3 Model Generation and Evaluation

Effective Feature Screening

From an initial set of 47 features, high‑collinearity variables are removed, leaving the final feature list used for modeling.

# Feature variables:
feature_columns = [
  '_7d_login_days',
  '_7d_login_cnt',
  'buy_service_cnt',
  'self_visit_days',
  'protocol_cnt',
  'month_avg_wc_trd_yuan'
]
# Target variable:
target_columns = ['is_vip']
columns = feature_columns + target_columns
feature_df = df[columns]

Train‑Test Split

from sklearn.model_selection import train_test_split
X_columns = feature_columns
Y_columns = target_columns
X_train = df[X_columns].values
Y_train = df[Y_columns].values.ravel()
train_x, test_x, train_y, test_y = train_test_split(X_train, Y_train, test_size=0.3, random_state=666)

Logistic Regression Training

Without hyper‑parameter tuning, the baseline model achieves accuracy 0.73, precision 0.75, recall 0.72.

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(class_weight='balanced', C=1, penalty='l2', max_iter=2000, random_state=666)
log_reg.fit(train_x, train_y)
pred_y_log = log_reg.predict(test_x)
pred_y_proba_log = log_reg.predict_proba(test_x)

Grid Search Hyper‑Parameter Tuning

Grid search optimizes for recall; the best recall 0.98 is obtained with C = 0.0001, penalty = l2. The optimal decision threshold is found to be 0.555 (instead of the default 0.5).

from sklearn.model_selection import GridSearchCV, StratifiedKFold
model = LogisticRegression(class_weight='balanced', max_iter=2000, random_state=666)
C = [0.0001,0.001,0.01,0.1,1,10,100,1000]
penalty = ['l1','l2']
param_grid = dict(C=C, penalty=penalty)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=666)
grid_search = GridSearchCV(model, param_grid=param_grid, scoring='recall', cv=kfold, n_jobs=-1)
grid_result = grid_search.fit(train_x, train_y)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Threshold Determination

# Compute KS and best threshold
fpr, tpr, thresholds = roc_curve(test_y, pred_y_proba_log[:,1], pos_label=1)
KS_max = 0
best_thr = 0
for i in range(len(fpr)):
    if i == 0:
        KS_max = tpr[i] - fpr[i]
        best_thr = thresholds[i]
    elif (tpr[i] - fpr[i] > KS_max):
        KS_max = tpr[i] - fpr[i]
        best_thr = thresholds[i]
print('Maximum KS:', KS_max)
print('Best threshold:', best_thr)

# Retrain with optimal threshold
log_reg = LogisticRegression(class_weight='balanced', C=0.0001, penalty='l2', max_iter=2000, random_state=666)
log_reg.fit(train_x, train_y)
pred_y_proba_log = log_reg.predict_proba(test_x)
pred_y_log = [1 if prob >= 0.555 else 0 for prob in pred_y_proba_log[:,1]]

Model Evaluation

On the test set, the model achieves accuracy 0.83, precision 0.80, recall 0.88, F1 0.83, and AUC 0.90. Lift values range from 1.22 to 1.67.

from sklearn.metrics import classification_report
print('Logistic Regression:
', classification_report(test_y, pred_y_log))

3.4 Model Prediction Deployment

The model is deployed via a big‑data scheduling tool (PySpark) to handle massive data and provide real‑time predictions that are pushed to downstream systems such as CRM and tagging platforms.

from pyhive import presto
import numpy as, pandas as pd, os, datetime
from datetime import timedelta
import time, re, json, sys, pyhdfs
from math import log
import sklearn
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder, StandardScaler, DictVectorizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.cluster import KMeans
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split, learning_curve
from sklearn.metrics import (accuracy_score, brier_score_loss, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_curve, auc)
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, lit, monotonically_increasing_id
from pyspark.sql.types import StringType, ArrayType
import pickle
# Feature columns used for prediction
feature_columns = ['_7d_login_days','_7d_login_cnt','buy_service_cnt','self_visit_days','protocol_cnt','month_avg_wc_GMV_yuan']

3.5 Model Effect Collection

After online deployment, the model’s actual performance and accuracy are evaluated using business‑side outreach data, confirming a good effect.

3.6 Conclusion

The logistic regression algorithm provides accurate predictions that empower business decisions, identifies key features influencing membership purchase, and suggests future iterations such as user segmentation, profiling, and integration with CRM for targeted marketing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data feature engineering model evaluation logistic regression membership prediction

Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.