Artificial Intelligence 11 min read

How to Build a Customer Churn Warning Model with R and Discover

This article demonstrates a step‑by‑step workflow for constructing a churn prediction model using R in Discover, covering data loading, preprocessing, feature extraction, labeling, random‑forest training, prediction, and evaluation to help businesses proactively retain high‑value customers.

StarRing Big Data Open Lab

Oct 20, 2017

How to Build a Customer Churn Warning Model with R and Discover

Churn warning is crucial for enterprises to predict and retain valuable customers early.

Data Loading

In a Discover Notebook, select R as the interpreter and load the built‑in dataset losingWarn, which contains three columns: trade_time (transaction timestamp), user_id (customer ID), and consume_amt (transaction amount).

library(discoverR)
library(ggplot2)
discover.init()

data("losingWarn")
df_losingWarn <- createDataFrame(losingWarn)

Data Processing and Analysis

Convert the transaction time to month and year, filter for the year 2014, and compute monthly transaction counts and amounts.

df_losingWarn <- df_losingWarn %>%
  mutate(trade_month = month(df_losingWarn$trade_time)) %>%
  mutate(trade_year = year(df_losingWarn$trade_time))

df_losingWarn <- df_losingWarn %>% filter("trade_year == 2014")

Summarize per‑user monthly statistics:

colName <- c("user_id", "trade_month", "consume_amt")
df_user_summary <- select(df_losingWarn, as.list(colName)) %>%
  groupBy("user_id", "trade_month") %>%
  summarize(trade_num = n(df_losingWarn$user_id),
            trade_amt = sum(df_losingWarn$consume_amt))

Summarize overall monthly totals:

colName <- c("trade_month", "trade_num", "trade_amt")
df_trade_summary <- select(df_user_summary, as.list(colName)) %>%
  groupBy("trade_month") %>%
  summarize(sum_trade_num = sum(df_user_summary$trade_num),
            sum_trade_amt = sum(df_user_summary$trade_amt))

Visualize transaction amount distribution (log‑scaled) and customer count trends.

Feature Extraction

Define a GetFeature function that extracts, for a given time window, the last trade month, maximum amount, maximum transaction count, total amount, total count, and their averages.

GetFeature <- function(data, startMonth, endMonth) {
  result <- data %>%
    filter(paste("trade_month >=", startMonth, "and trade_month <=", endMonth))
  month <- as.integer(endMonth) - as.integer(startMonth) + 1
  out <- result %>% groupBy("user_id") %>% summarize(
    lastTradeMonth = max(result$trade_month),
    maxAMT = max(result$trade_amt),
    maxNum = max(result$trade_num),
    sumAMT = sum(result$trade_amt),
    sumNum = sum(result$trade_num)
  )
  out %>% mutate(
    averageAMT = out$sumAMT / month,
    averageNum = out$sumNum / month
  )
}

Labeling

Define a GetLabel function that marks a user as "losing" if the total transaction count in the observation window is zero, otherwise as "remain".

GetLabel <- function(data, startMonth, endMonth) {
  out <- data %>% filter(paste("trade_month >=", startMonth, "and trade_month <=", endMonth)) %>%
    group_by("user_id") %>%
    summarize(sumNum = sum(data$trade_num))
  out <- out %>% mutate(label = ifelse(out$sumNum > 0, "remain", "losing")) %>%
    select(as.list(c("user_id", "label")))
  names(out) <- c("id", "label")
  out
}

Model Building

Combine features and labels for the training window (June‑September) and train a random‑forest classifier.

GetModelData <- function(data, startMonth, endMonth, observeSize) {
  feature <- GetFeature(data, startMonth, endMonth)
  labelStartMonth <- endMonth + 1
  labelEndMonth <- endMonth + observeSize
  label <- GetLabel(data, labelStartMonth, labelEndMonth)
  out <- join(feature, label, feature$user_id == label$id, "left_outer") %>% drop("id")
  out %>% mutate(label = ifelse(out$label == "remain", "remain", "losing"))
}

training_data <- GetModelData(df_user_summary, 6, 9, 2)
rf_model <- txRandomForest(data = training_data,
                           formula = label ~ lastTradeMonth + maxAMT + maxNum + sumAMT + sumNum + averageAMT + averageNum,
                           type = "classification")

Prediction

Apply the model to the test window (July‑October) and obtain predictions.

test_data <- GetModelData(df_user_summary, 7, 10, 2)
result <- predict(rf_model, test_data)
showDF(result)

Model Evaluation

Evaluate using ROC AUC and inspect feature importance.

areaUnderROC <- txBinaryClassificationEvaluator(result, metricName = "areaUnderROC",
                                                probabilityCol = "probability",
                                                labelCol = "label",
                                                labels = summary(rf_model)$labels)
areaUnderROC

variable_importance <- importance(rf_model)
variable_importance

The evaluation shows that averageNum and sumNum are the most influential features, confirming the earlier hypothesis that a user's average and total transaction counts strongly affect retention.

Conclusion

This case study outlines the complete analytical pipeline for building a churn warning model in Discover, from data ingestion to model evaluation, illustrating how Discover’s built‑in functions simplify predictive analytics and help enterprises implement timely marketing actions to improve customer value.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning random forest churn prediction R customer analytics Discover

Written by

StarRing Big Data Open Lab

Focused on big data technology research, exploring the Big Data era | [email protected]

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.