How to Build a Customer Churn Warning Model with R and Discover
This article demonstrates a step‑by‑step workflow for constructing a churn prediction model using R in Discover, covering data loading, preprocessing, feature extraction, labeling, random‑forest training, prediction, and evaluation to help businesses proactively retain high‑value customers.
Churn warning is crucial for enterprises to predict and retain valuable customers early.
Data Loading
In a Discover Notebook, select R as the interpreter and load the built‑in dataset losingWarn, which contains three columns: trade_time (transaction timestamp), user_id (customer ID), and consume_amt (transaction amount).
library(discoverR)
library(ggplot2)
discover.init()
data("losingWarn")
df_losingWarn <- createDataFrame(losingWarn)Data Processing and Analysis
Convert the transaction time to month and year, filter for the year 2014, and compute monthly transaction counts and amounts.
df_losingWarn <- df_losingWarn %>%
mutate(trade_month = month(df_losingWarn$trade_time)) %>%
mutate(trade_year = year(df_losingWarn$trade_time))
df_losingWarn <- df_losingWarn %>% filter("trade_year == 2014")Summarize per‑user monthly statistics:
colName <- c("user_id", "trade_month", "consume_amt")
df_user_summary <- select(df_losingWarn, as.list(colName)) %>%
groupBy("user_id", "trade_month") %>%
summarize(trade_num = n(df_losingWarn$user_id),
trade_amt = sum(df_losingWarn$consume_amt))Summarize overall monthly totals:
colName <- c("trade_month", "trade_num", "trade_amt")
df_trade_summary <- select(df_user_summary, as.list(colName)) %>%
groupBy("trade_month") %>%
summarize(sum_trade_num = sum(df_user_summary$trade_num),
sum_trade_amt = sum(df_user_summary$trade_amt))Visualize transaction amount distribution (log‑scaled) and customer count trends.
Feature Extraction
Define a GetFeature function that extracts, for a given time window, the last trade month, maximum amount, maximum transaction count, total amount, total count, and their averages.
GetFeature <- function(data, startMonth, endMonth) {
result <- data %>%
filter(paste("trade_month >=", startMonth, "and trade_month <=", endMonth))
month <- as.integer(endMonth) - as.integer(startMonth) + 1
out <- result %>% groupBy("user_id") %>% summarize(
lastTradeMonth = max(result$trade_month),
maxAMT = max(result$trade_amt),
maxNum = max(result$trade_num),
sumAMT = sum(result$trade_amt),
sumNum = sum(result$trade_num)
)
out %>% mutate(
averageAMT = out$sumAMT / month,
averageNum = out$sumNum / month
)
}Labeling
Define a GetLabel function that marks a user as "losing" if the total transaction count in the observation window is zero, otherwise as "remain".
GetLabel <- function(data, startMonth, endMonth) {
out <- data %>% filter(paste("trade_month >=", startMonth, "and trade_month <=", endMonth)) %>%
group_by("user_id") %>%
summarize(sumNum = sum(data$trade_num))
out <- out %>% mutate(label = ifelse(out$sumNum > 0, "remain", "losing")) %>%
select(as.list(c("user_id", "label")))
names(out) <- c("id", "label")
out
}Model Building
Combine features and labels for the training window (June‑September) and train a random‑forest classifier.
GetModelData <- function(data, startMonth, endMonth, observeSize) {
feature <- GetFeature(data, startMonth, endMonth)
labelStartMonth <- endMonth + 1
labelEndMonth <- endMonth + observeSize
label <- GetLabel(data, labelStartMonth, labelEndMonth)
out <- join(feature, label, feature$user_id == label$id, "left_outer") %>% drop("id")
out %>% mutate(label = ifelse(out$label == "remain", "remain", "losing"))
}
training_data <- GetModelData(df_user_summary, 6, 9, 2)
rf_model <- txRandomForest(data = training_data,
formula = label ~ lastTradeMonth + maxAMT + maxNum + sumAMT + sumNum + averageAMT + averageNum,
type = "classification")Prediction
Apply the model to the test window (July‑October) and obtain predictions.
test_data <- GetModelData(df_user_summary, 7, 10, 2)
result <- predict(rf_model, test_data)
showDF(result)Model Evaluation
Evaluate using ROC AUC and inspect feature importance.
areaUnderROC <- txBinaryClassificationEvaluator(result, metricName = "areaUnderROC",
probabilityCol = "probability",
labelCol = "label",
labels = summary(rf_model)$labels)
areaUnderROC
variable_importance <- importance(rf_model)
variable_importanceThe evaluation shows that averageNum and sumNum are the most influential features, confirming the earlier hypothesis that a user's average and total transaction counts strongly affect retention.
Conclusion
This case study outlines the complete analytical pipeline for building a churn warning model in Discover, from data ingestion to model evaluation, illustrating how Discover’s built‑in functions simplify predictive analytics and help enterprises implement timely marketing actions to improve customer value.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
StarRing Big Data Open Lab
Focused on big data technology research, exploring the Big Data era | [email protected]
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
