How to Build a Customer Churn Warning Model with R and Discover

This article demonstrates a step‑by‑step workflow for constructing a churn prediction model using R in Discover, covering data loading, preprocessing, feature extraction, labeling, random‑forest training, prediction, and evaluation to help businesses proactively retain high‑value customers.

StarRing Big Data Open Lab
StarRing Big Data Open Lab
StarRing Big Data Open Lab
How to Build a Customer Churn Warning Model with R and Discover

Churn warning is crucial for enterprises to predict and retain valuable customers early.

Data Loading

In a Discover Notebook, select R as the interpreter and load the built‑in dataset losingWarn, which contains three columns: trade_time (transaction timestamp), user_id (customer ID), and consume_amt (transaction amount).

library(discoverR)
library(ggplot2)
discover.init()

data("losingWarn")
df_losingWarn <- createDataFrame(losingWarn)

Data Processing and Analysis

Convert the transaction time to month and year, filter for the year 2014, and compute monthly transaction counts and amounts.

df_losingWarn <- df_losingWarn %>%
  mutate(trade_month = month(df_losingWarn$trade_time)) %>%
  mutate(trade_year = year(df_losingWarn$trade_time))

df_losingWarn <- df_losingWarn %>% filter("trade_year == 2014")

Summarize per‑user monthly statistics:

colName <- c("user_id", "trade_month", "consume_amt")
df_user_summary <- select(df_losingWarn, as.list(colName)) %>%
  groupBy("user_id", "trade_month") %>%
  summarize(trade_num = n(df_losingWarn$user_id),
            trade_amt = sum(df_losingWarn$consume_amt))

Summarize overall monthly totals:

colName <- c("trade_month", "trade_num", "trade_amt")
df_trade_summary <- select(df_user_summary, as.list(colName)) %>%
  groupBy("trade_month") %>%
  summarize(sum_trade_num = sum(df_user_summary$trade_num),
            sum_trade_amt = sum(df_user_summary$trade_amt))

Visualize transaction amount distribution (log‑scaled) and customer count trends.

Transaction Amount Distribution
Transaction Amount Distribution
Customer Number Trend
Customer Number Trend

Feature Extraction

Define a GetFeature function that extracts, for a given time window, the last trade month, maximum amount, maximum transaction count, total amount, total count, and their averages.

GetFeature <- function(data, startMonth, endMonth) {
  result <- data %>%
    filter(paste("trade_month >=", startMonth, "and trade_month <=", endMonth))
  month <- as.integer(endMonth) - as.integer(startMonth) + 1
  out <- result %>% groupBy("user_id") %>% summarize(
    lastTradeMonth = max(result$trade_month),
    maxAMT = max(result$trade_amt),
    maxNum = max(result$trade_num),
    sumAMT = sum(result$trade_amt),
    sumNum = sum(result$trade_num)
  )
  out %>% mutate(
    averageAMT = out$sumAMT / month,
    averageNum = out$sumNum / month
  )
}

Labeling

Define a GetLabel function that marks a user as "losing" if the total transaction count in the observation window is zero, otherwise as "remain".

GetLabel <- function(data, startMonth, endMonth) {
  out <- data %>% filter(paste("trade_month >=", startMonth, "and trade_month <=", endMonth)) %>%
    group_by("user_id") %>%
    summarize(sumNum = sum(data$trade_num))
  out <- out %>% mutate(label = ifelse(out$sumNum > 0, "remain", "losing")) %>%
    select(as.list(c("user_id", "label")))
  names(out) <- c("id", "label")
  out
}

Model Building

Combine features and labels for the training window (June‑September) and train a random‑forest classifier.

GetModelData <- function(data, startMonth, endMonth, observeSize) {
  feature <- GetFeature(data, startMonth, endMonth)
  labelStartMonth <- endMonth + 1
  labelEndMonth <- endMonth + observeSize
  label <- GetLabel(data, labelStartMonth, labelEndMonth)
  out <- join(feature, label, feature$user_id == label$id, "left_outer") %>% drop("id")
  out %>% mutate(label = ifelse(out$label == "remain", "remain", "losing"))
}

training_data <- GetModelData(df_user_summary, 6, 9, 2)
rf_model <- txRandomForest(data = training_data,
                           formula = label ~ lastTradeMonth + maxAMT + maxNum + sumAMT + sumNum + averageAMT + averageNum,
                           type = "classification")

Prediction

Apply the model to the test window (July‑October) and obtain predictions.

test_data <- GetModelData(df_user_summary, 7, 10, 2)
result <- predict(rf_model, test_data)
showDF(result)

Model Evaluation

Evaluate using ROC AUC and inspect feature importance.

areaUnderROC <- txBinaryClassificationEvaluator(result, metricName = "areaUnderROC",
                                                probabilityCol = "probability",
                                                labelCol = "label",
                                                labels = summary(rf_model)$labels)
areaUnderROC

variable_importance <- importance(rf_model)
variable_importance
ROC Curve
ROC Curve
Feature Importance
Feature Importance

The evaluation shows that averageNum and sumNum are the most influential features, confirming the earlier hypothesis that a user's average and total transaction counts strongly affect retention.

Conclusion

This case study outlines the complete analytical pipeline for building a churn warning model in Discover, from data ingestion to model evaluation, illustrating how Discover’s built‑in functions simplify predictive analytics and help enterprises implement timely marketing actions to improve customer value.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningrandom forestchurn predictionRcustomer analyticsDiscover
StarRing Big Data Open Lab
Written by

StarRing Big Data Open Lab

Focused on big data technology research, exploring the Big Data era | [email protected]

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.