Machine Learning and Survival Analysis

Predicting customer churn using complex statistical modeling

Author

Patrick Lefler

Published

February 17, 2026

Abstract

Credit card attrition is rarely a sudden decision — it is a behavioral drift detectable months before an account formally closes. Using over 10,000 anonymized customer records, this analysis builds a three-stage intelligence framework for a consumer credit card division. Logistic regression isolates the specific behavioral and demographic factors that amplify departure risk, expressed as interpretable Risk Multipliers. A Random Forest classifier then scores every active account on an individualized churn probability, producing a ranked tactical outreach list for the retention team. Finally, Kaplan-Meier survival analysis maps the customer lifecycle to pinpoint the tenure milestones where attrition risk concentrates — identifying a pronounced drop-off at the 36-month mark consistent with expiring promotional incentives. Across all three methods, transaction velocity emerges as the dominant signal: when a customer stops using their card, they are signaling intent to leave long before they make the call.

Setup

Show code

# Load required libraries for the analysis
library(formattable) # Allow for table styling - currency
library(kableExtra)  # Professional table formatting
library(patchwork)  # Combine multiple plots in rows & columns
library(plotly)      # Interactive executive charts
library(readxl)      # Ingesting Excel workbooks 
library(sessioninfo) # Detailed session information
library(survival)    # Tenure and 'time-to-event' analysis
library(survminer)   # Visualizing survival curves
library(tidyverse)   # Data wrangling and visualization
library(tidymodels)  # Unified modeling framework
library(vip)         # Feature importance plots

# Load the historic bank customer data 
raw_data <- read_excel("data/bank_attrition_data.xlsm", sheet = "bankChurners")

# Preparing data: ensuring all chr columns are a factor and predictors are cleaned
attritionData <- raw_data %>%
  mutate(
    churn = factor(attrition.flag, levels = c("Existing Customer", "Attrited Customer")),
    across(where(is.character), as.factor)
  ) %>%
  mutate(gender = ifelse(gender == "F", "Female", "Male")) %>%
  mutate(gender = factor(gender, levels = c("Female", "Male"))) %>%
  mutate(income.bracket = case_when(
      income.bracket == "Unknown" ~ "Unknown",
      income.bracket == "Less than $40K" ~ "<$40k",
      income.bracket == "$40K - $60K" ~ "$40k-$60k",
      income.bracket == "$60K - $80K" ~ "$60k-$80k",
      income.bracket == "$80K - $120K" ~ "$80k-$120k",
      income.bracket == "$120K +" ~ ">$120k")) %>%
  
  mutate(income.bracket = factor(income.bracket, levels = c("<$40k", "$40k-$60k", 
                                                              "$60k-$80k", "$80k-$120k",
                                                              ">$120k", "Unknown"))) %>%
  
  mutate(card.category = factor(card.category, levels = c("Blue", "Silver",
                                                          "Gold", "Platinum"))) %>%
  
  mutate(education.level = factor(education.level, levels = c("Uneducated", 
                                    "High School", "College", "Graduate", 
                                    "Post-Graduate","Doctorate", "Unknown"))) %>%
  
  mutate(marital.status = factor(marital.status, levels = c("Single", "Married",
                                                            "Divorced", "Unknown"))) %>%

  mutate(age.bracket = ifelse(customer.age < 30, "<30", 
                                  ifelse (customer.age < 40, "30-39",
                                  ifelse (customer.age < 50, "40-49",
                                  ifelse (customer.age < 60, "50-59", "60+")
                                  )
                                  )
                                  )) %>%
  
  mutate(age.bracket = factor(age.bracket, levels = c("<30", "30-39", "40-49", 
                                                      "50-59", "60+")))

Introduction

This project provides a comprehensive data science framework for identifying, analyzing, and predicting customer attrition within a consumer credit card division. By leveraging a historic data set of over 10,000 records, this analysis moves beyond descriptive reporting to deliver actionable risk intelligence and tactical insights.

The underlying data is sourced from the Kaggle Credit Card Customers data set that can be accessed here. It contains anonymized profiles of both current and former clients, blending two distinct data categories:

Demographics: Detailed attributes including age, gender, marital status, income category, and education level.
Account Behavior: Performance metrics such as credit limits, revolving balances, transaction frequency, and bank-initiated communication logs.

Within this population, customer churn (represented by former bank customers) accounts for approximately 16% of the total data set, providing a robust sample for predictive modeling and behavioral analysis.

The primary goal of this analysis is to transform raw data into a proactive retention strategy through four key methodologies:

Identify Drivers: Utilizing Logistic Regression to isolate specific “Risk Multipliers”—the behavioral factors that significantly increase the likelihood of account closure.
Predict Risk: Deploying a Random Forest machine learning model to assign an individualized churn probability score to every existing customer.
Analyze Lifecycle: Implementing Survival Analysis to map “Customer Life Expectancy,” allowing the bank to identify critical tenure milestones where the risk of departure is highest.
Tactical Action: Generating a prioritized outreach list of at-risk, active customers, enabling the retention team to focus resources where they will have the highest impact.

Show code

attritionDataSubset <- attritionData %>%
  select(customer.age,
         months.on.book,
         total.relationship.count,
         months.inactive.12.months,
         contacts.count.12.month,
         credit.limit,
         total.revolving.balance,
         avg.open.to.buy,
         total.amount.changed.previous.quarter,
         total.transaction.amount,
         total.transaction.count,
         avg.utilization.rate
         )

summary_data <- attritionDataSubset %>%
  pivot_longer(cols = everything(), names_to = "metric") %>%
  group_by(metric) %>%
  summarize(
    max = max(value),
    min = min(value),
    mean = mean(value),
    std_dev = sd(value),
    first_qrt = quantile(value, probs = .25),
    third_qrt = quantile(value, probs = .75)) %>%
  mutate(metric = case_when(
      metric == "customer.age" ~ "Customer Age",
      metric == "months.on.book" ~ "Months on Book",
      metric == "total.relationship.count" ~ "Total Relationship Count",
      metric == "months.inactive.12.months" ~ "Months Inactive Past Year",
      metric == "contacts.count.12.month" ~ "Bank Communications Past Year",
      metric == "credit.limit" ~ "Credit Limit",
      metric == "total.revolving.balance" ~ "Total Revolving Balance",
      metric == "avg.open.to.buy" ~ "Average Open to Buy",
      metric == "total.amount.changed.previous.quarter" ~ "Total Amount Changed Previous Quarter",
      metric == "total.transaction.amount" ~ "Total Transaction Amount",
      metric == "total.transaction.count" ~ "Total Transaction Count",
      metric == "avg.utilization.rate" ~ "Average Utilization Rate")) %>%
  relocate(metric, min, first_qrt, mean, third_qrt, max, std_dev) %>%
  select(Metric = metric, 
         Minimum = min,
         "1st Quartile" = first_qrt,
         Mean = mean,
         "3rd Quartile" = third_qrt,
         Maximum = max,
         "Std Dev" = std_dev)

kbl(summary_data,caption = "Summary of Quantitative Date from Dataset", digits = 1, align = ("lcccr")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F), fixed_thead = T) %>%
scroll_box(width = "100%", height = "480px")

Summary of Quantitative Date from Dataset
Metric	Minimum	1st Quartile	Mean	3rd Quartile	Maximum	Std Dev
Average Open to Buy	3.0	1324.5	7469.1	9859.0	34516.0	9090.7
Average Utilization Rate	0.0	0.0	0.3	0.5	1.0	0.3
Bank Communications Past Year	0.0	2.0	2.5	3.0	6.0	1.1
Credit Limit	1438.3	2555.0	8632.0	11067.5	34516.0	9088.8
Customer Age	26.0	41.0	46.3	52.0	73.0	8.0
Months Inactive Past Year	0.0	2.0	2.3	3.0	6.0	1.0
Months on Book	13.0	31.0	35.9	40.0	56.0	8.0
Total Amount Changed Previous Quarter	0.0	0.6	0.8	0.9	3.4	0.2
Total Relationship Count	1.0	3.0	3.8	5.0	6.0	1.6
Total Revolving Balance	0.0	359.0	1162.8	1784.0	2517.0	815.0
Total Transaction Amount	510.0	2155.5	4404.1	4741.0	18484.0	3397.1
Total Transaction Count	10.0	45.0	64.9	81.0	139.0	23.5

Visualizing Risk: A Comparative View

Understanding the Behavioral Gap

To determine why customers leave, behavioral signatures—patterns in how individuals use their cards before closing an account are examined. In this analysis, Exploratory Data Analysis (EDA) is utilized to compare the habits of over 10,000 customers. By charting transaction counts against transaction amounts, current at-risk customers can be better identified.

Statistical analysis shows that churned customers aren’t necessarily those with the lowest credit limits; rather, they are the ones who have stopped integrating the card into their daily routine. Identifying this drop in transaction velocity might allow the bank to intervene weeks or months before a customer formally requests to cancel, transforming the strategy from reactive damage control to proactive relationship management.

Interactive comparison of transaction counts and total spending.

Show code

#| label: visual-analysis
#| fig-cap: "Interactive comparison of transaction counts and total spending."

p <- ggplot(attritionData, aes(x = total.transaction.count, y = total.transaction.amount, color = churn)) +
  geom_point(alpha = 0.3) +
  scale_color_manual(values = c("Existing Customer" = "#3e3f3a", "Attrited Customer" = "#df691a")) +
  labs(title = "Transaction Velocity: Usage vs. Attrition Status",
       x = "Total Transaction Count (Annual)",
       y = "Total Transaction Amount ($)",
       color = "Status") +
  theme_minimal()

ggplotly(p)

Segmenting Attrition Across Key Factors

While the Transaction Velocity plot provides a high-level view of account usage, attrition risk is sometimes distributed unevenly across different demographic segments. In this section, a series of comparative plots is created that may allow the reader to observe how attrition rates fluctuate across variables such as age, gender, marital status, income and education. In this case, however, there seem to be no real outliers that could provide realistic insight as to why customer stay or leave.

Show code

# Visualization by Gender
plotGender <- ggplot(attritionData, aes(x = gender, fill = churn)) +
  geom_bar(position = "fill", alpha = 0.8) +
  scale_fill_manual(values = c("Existing Customer" = "#e5e4e2", "Attrited Customer" = "#df691a")) +
  scale_y_continuous(labels = scales::percent) + 
  labs(subtitle  = "Gender",
       x = "",
       y = "",
       fill = "Status") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position="none")

# Visualization by Age
plotAge <- ggplot(attritionData, aes(x = age.bracket, fill = churn)) +
  geom_bar(position = "fill", alpha = 0.8) +
  scale_fill_manual(values = c("Existing Customer" = "#e5e4e2", "Attrited Customer" = "#df691a")) +
  scale_y_continuous(labels = scales::percent) + 
  labs(subtitle  = "Age",
       x = "",
       y = "",
       fill = "Status") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position="bottom")

# Visualization by Marital Status
plotMaritalStatus <- ggplot(attritionData, aes(x = marital.status, fill = churn)) +
  geom_bar(position = "fill", alpha = 0.8) +
  scale_fill_manual(values = c("Existing Customer" = "#e5e4e2", "Attrited Customer" = "#df691a")) +
  scale_y_continuous(labels = scales::percent) + 
  labs(subtitle  = "Marital Status",
       x = "",
       y = "",
       fill = "Status") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position="none") +
  theme(legend.position = "none")

# Visualization by Income
plotIncome <- ggplot(attritionData, aes(x = income.bracket, fill = churn)) +
  geom_bar(position = "fill", alpha = 0.8) +
  scale_fill_manual(values = c("Existing Customer" = "#e5e4e2", "Attrited Customer" = "#df691a")) +
  scale_y_continuous(labels = scales::percent) + 
  labs(subtitle = "Income",
       x = "",
       y = "",
       fill = "Status") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position="none")

# Visualization by Edcuation Level
plotEducation <- ggplot(attritionData, aes(x = education.level, fill = churn)) +
  geom_bar(position = "fill", alpha = 0.8) +
  scale_fill_manual(values = c("Existing Customer" = "#e5e4e2", "Attrited Customer" = "#df691a")) +
  scale_y_continuous(labels = scales::percent) + 
  labs(subtitle = "Education",
       x = "",
       y = "",
       fill = "Status") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position="none")


(plotGender + plotAge + plotMaritalStatus) / (plotIncome + plotEducation) + plot_annotation(title = 'Customer Attrition Segmentation Across Multiple Factors')

Show code

# plotIncome + plotEducation + plot_annotation(title = 'Customer Attrition Segmentation Across Income & Education')

Statistical Drivers

Identifying the “Why” with Risk Multipliers

To move beyond simple charts, Logistic regression is utilized to calculate the Risk Multiplier (mathematically, the “exponentiatal coefficient”). This multiplier is calculated by taking the raw statistical weights and transforming them into a scale that represents the “Odds of Churn.”

How to Interpret the Numbers:

Multiplier > 1 (Risk Driver): This indicates that as this factor increases, the risk of attrition increases. A multiplier of 1.20 means that for every one-unit increase in that factor, the likelihood of a customer leaving grows by 20%.

Multiplier = 1 (Neutral): This means the factor has no impact on the risk of attrition. It is a neutral variable that does not help us predict whether a customer will stay or leave.

Multiplier < 1 (Protective Factor): This indicates that as this factor increases, the risk of attrition decreases. A multiplier of 0.80 means that for every one-unit increase, the risk of churn is reduced by 20%.

In this updated view, all variables in the model are included to provide a complete picture of every factor recorded by the bank, from age to revolving balance.

Show code

#| label: drivers-table

# Fit logistic regression including 'all' significant drivers from the data set
logit_model <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(churn ~ . - client.id - attrition.flag, data = attritionData)

# Formatting the comprehensive drivers table
all_drivers_table <- tidy(logit_model, exponentiate = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    term = str_to_title(str_replace_all(term, "\\.", " ")),
    Impact = case_when(
      estimate > 1.05 ~ "Higher Risk",
      estimate < 0.95 ~ "Protective Factor",
      TRUE ~ "Neutral"
    )
  ) %>%
  select(Factor = term, `Risk Multiplier` = estimate, Impact) %>%
  arrange(desc(`Risk Multiplier`)) 

tblData <- all_drivers_table %>%
  filter(Factor != "Avg Open To Buy") %>%
  mutate(Factor = case_when(
      Factor == "Age Bracket50-59" ~ "Age Bracket: 50-59",
      Factor == "Age Bracket40-49" ~ "Age Bracket: 40-49",
      Factor == "Age Bracket60+" ~ "Age Bracket: 60+",
      Factor == "Age Bracket30-39" ~ "Age Bracket: 30-39",
      Factor == "Card Categorygold" ~ "Card Category: Gold",
      Factor == "Card Categoryplatinum" ~ "Card Category: Platinum",
      Factor == "Income Bracket>$120k" ~ "Income: > $120k",
      Factor == "Contacts Count 12 Month" ~ "Previous Year Contacts",
      Factor == "Months Inactive 12 Months" ~ "Months Inactive Previous 12 Months",
      Factor == "Income Bracket$80k-$120k" ~ "Income: $80k to $120k",
      Factor == "Card Categorysilver" ~ "Card Category: Silver",
      Factor == "Education Leveldoctorate" ~ "Education Level: Doctorate",
      Factor == "Education Levelpost-Graduate" ~ "Education Level: Post Graduate",
      Factor == "Dependent Count" ~ "Number of Dependents",
      Factor == "Income Bracket$60k-$80k" ~ "Income: $60k to $80k",
      Factor == "Education Levelunknown" ~ "Education Level: Unknown",
      Factor == "Total Transaction Amount" ~ "Total Transaction Amount",
      Factor == "Credit Limit" ~ "Credit Limit",
      Factor == "Total Revolving Balance" ~ "Total Revolving Balance",
      Factor == "Months On Book" ~ "Months On Book",
      Factor == "Customer Age" ~ "Customer Age",
      Factor == "Education Levelhigh School" ~ "Education Level: High School",
      Factor == "Income Bracketunknown" ~ "Income: Unknown",
      Factor == "Marital Statusunknown" ~ "Marital Status: Unknown",
      Factor == "Education Levelcollege" ~ "Education Level: College",
      Factor == "Education Levelgraduate" ~ "Education Level: Graduate",
      Factor == "Marital Statusdivorced" ~ "Marital Status: Divorced",
      Factor == "Total Transaction Count" ~ "Total Transaction Count",
      Factor == "Avg Utilization Rate" ~ "Average Utilization Rate",
      Factor == "Income Bracket$40k-$60k" ~ "Income: $40k to $60k",
      Factor == "Total Amount Changed Previous Quarter" ~ "Total Amount Changed Previous Quarter",
      Factor == "Total Relationship Count" ~ "Total Relationship Count",
      Factor == "Marital Statusmarried" ~ "Marital Status: Married",
      Factor == "Gendermale" ~ "Gender: Male",
      Factor == "Total Count Change Previous Quarter" ~ "Total Count Change Previous Quarter")
      )

  kbl(tblData, caption = "Comprehensive Analysis of All Account Drivers", digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F) %>%
  column_spec(3, color = "white", 
              background = case_when(
                all_drivers_table$Impact == "Higher Risk" ~ "#df691a",
                all_drivers_table$Impact == "Protective Factor" ~ "#5cb85c",
                TRUE ~ "#999999")
              ) %>%
    scroll_box(width = "100%", height = "500px")

Comprehensive Analysis of All Account Drivers
Factor	Risk Multiplier	Impact
Age Bracket: 50-59	14.67	Higher Risk
Age Bracket: 40-49	13.90	Higher Risk
Age Bracket: 60+	11.35	Higher Risk
Age Bracket: 30-39	7.06	Higher Risk
Card Category: Gold	2.91	Higher Risk
Card Category: Platinum	2.61	Higher Risk
Income: > $120k	1.92	Higher Risk
Previous Year Contacts	1.69	Higher Risk
Months Inactive Previous 12 Months	1.66	Higher Risk
Card Category: Silver	1.56	Higher Risk
Income: $80k to $120k	1.41	Higher Risk
Education Level: Post Graduate	1.31	Higher Risk
Education Level: Doctorate	1.29	Higher Risk
Education Level: Unknown	1.07	Higher Risk
Number of Dependents	1.03	Neutral
Total Transaction Amount	1.00	Neutral
Income: $60k to $80k	1.00	Neutral
Credit Limit	1.00	Neutral
Total Revolving Balance	1.00	Neutral
Income: Unknown	1.00	Neutral
Months On Book	1.00	Neutral
Average Utilization Rate	0.96	Neutral
Education Level: High School	0.96	Neutral
Customer Age	0.96	Neutral
Education Level: College	0.94	Protective Factor
Education Level: Graduate	0.93	Protective Factor
Total Transaction Count	0.89	Protective Factor
Marital Status: Unknown	0.87	Protective Factor
Income: $40k to $60k	0.84	Protective Factor
Marital Status: Divorced	0.81	Protective Factor
Total Amount Changed Previous Quarter	0.67	Protective Factor
Total Relationship Count	0.64	Protective Factor
Marital Status: Married	0.52	Protective Factor
Gender: Male	0.47	Protective Factor
Total Count Change Previous Quarter	0.06	Protective Factor

Predictive Intelligence

Decoding the Predictive Ranking Scale

The horizontal axis, labeled “Importance Score”, represents the predictive contribution of each factor. In this analysis, a calculation called Gini Impurity is used to determine these scores. Think of impurity as the amount of uncertainty or “clutter” in the data. Every time the model uses a variable like Transaction Count to successfully sort customers into “Stay” or “Leave” buckets, it reduces that clutter.

The scale (ranging from 0 to 250) is a calculated aggregate score, not a direct count of customers. It represents the total amount of clarity gained across all the thousands of decision trees in the model. While a larger database allows for more complex splits—which can result in higher total numbers—the absolute value is less important than the relative distance between the bars. For example, if one factor has a score of 180 and another has 60, the first is three times as powerful at helping predict the future status of an account. The ranking helps ensure that the model is prioritized around the same high-impact behaviors—like usage velocity and revolving balances—that industry experience suggests are the true drivers of risk.

What is a bit perplexing, is that overall age component does not seem to have significant influence as a key indicator of future customer attrition based on the random forest model results below, but certain age factors (ages > 50) rank extremely high on the risk multiplier plot above. More investigative work needs to be performed to better explain the apparent discrepancy.

Show code

# Build Random Forest model - explicitly excluding target variables to prevent data leakage
rf_model <- rand_forest() %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification") %>%
  fit(churn ~ . - client.id - attrition.flag, data = attritionData)

# 1. Importance Plot: Visualizing the predictive 'heavy lifters' using native ggplot
importance_data <- vi(rf_model) %>%
  mutate(Variable = str_to_title(str_replace_all(Variable, "\\.", " ")))

ggplotData <- importance_data %>%
  filter(Variable != "Attrition Flag") %>%    ## Exclude "Attrition Flag" data from plot
  filter(Variable != "Client Id")             ## Exclude "Client Id" data from plot

importancePlot <- ggplot(ggplotData, aes(x = Importance, y = reorder(Variable, Importance))) +
  geom_col(fill = "#df691a", alpha = 0.8) +
  labs(title = "Predictive Ranking: Key Indicators of Customer Attrition",
       subtitle = "Calculated relative contribution to model accuracy",
       x = "Importance Score (Weighted Information Gain)",
       y = "Customer Attribute") +
  xlim(0, 250) +
  theme_minimal()

importancePlot

High-Risk Attrition List Customers

Identifying customers at risk for attrition

After examining the data, all current customers were ranked from high to low in terms of the calculated probability for them to depart. As one can see, even the highest ranking at-risk customer only has a calculated probability to leave of 10%. The next step would be to drill-down on these at-risk customers to refine the actual probability of departure. More data is needed to improve confidence in the model. As is the case with most machine learning and logistic regression analysis, the quality of the outcome is only as good as the quality of the data.

Show code

# Generate tactical high-risk list for EXISTING CUSTOMERS ONLY
high_risk_targets <- predict(rf_model, attritionData, type = "prob") %>%
  rename_with(~"churn_prob", starts_with(".pred_Attrited")) %>% 
  bind_cols(attritionData %>% select(client.id, 
                                     gender, 
                                     age.bracket, 
                                     income.bracket, 
                                     card.category, 
                                     total.revolving.balance,
                                     total.transaction.amount,
                                     contacts.count.12.month, 
                                     attrition.flag)) %>%
  filter(attrition.flag == "Existing Customer") %>%
  arrange(desc(churn_prob)) 

tblData <- high_risk_targets %>%
  select(ClientID = client.id,
         Gender = gender,
         "Age Bracket" = age.bracket, 
         Income = income.bracket, 
         'Outstanding Balance' = total.revolving.balance, 
         `Churn Probability` = churn_prob, 
         `Card Tier` = card.category, 
         "Recent Transaction Amount" = total.transaction.amount,
         `Recent Contacts` = contacts.count.12.month) %>%
        relocate(ClientID, 
                 Gender, 
                 "Age Bracket", 
                 Income, 
                 "Card Tier", 
                 "Outstanding Balance", 
                 "Recent Transaction Amount",
                 "Recent Contacts", 
                 "Churn Probability") %>%
        slice(1:50)

tblData$"Churn Probability" <- formattable::percent(tblData$"Churn Probability")
tblData$"Outstanding Balance" <- formattable::currency(tblData$"Outstanding Balance", "$", format = "d")
tblData$"Recent Transaction Amount" <- formattable::currency(tblData$"Recent Transaction Amount", "$", format = "d")

kbl(tblData,caption = "Tactical Outreach List: Top 50 Highest At-Risk ACTIVE Accounts", digits = 3, align = ("lcclcrccr")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F), fixed_thead = T) %>%
scroll_box(width = "100%", height = "500px")

Tactical Outreach List: Top 50 Highest At-Risk ACTIVE Accounts
ClientID	Gender	Age Bracket	Income	Card Tier	Outstanding Balance	Recent Transaction Amount	Recent Contacts	Churn Probability
785432733	Female	40-49	<$40k	Gold	$ 0	$966	3	12.09%
721425558	Male	50-59	>$120k	Blue	$ 0	$1,536	2	9.01%
712215258	Female	50-59	$40k-$60k	Silver	$ 0	$1,298	3	8.28%
717975333	Male	50-59	$80k-$120k	Blue	$1,330	$837	2	8.17%
754897008	Male	40-49	$40k-$60k	Blue	$1,418	$1,319	2	7.64%
771075258	Male	50-59	>$120k	Silver	$1,527	$1,268	2	7.64%
709465758	Female	60+	<$40k	Blue	$ 0	$902	3	7.57%
713497983	Male	40-49	$60k-$80k	Blue	$ 0	$3,459	2	7.46%
710662158	Male	40-49	>$120k	Blue	$2,517	$2,051	3	7.40%
719621958	Male	40-49	$60k-$80k	Blue	$ 0	$1,720	3	7.32%
788965683	Female	40-49	<$40k	Blue	$ 0	$2,170	3	7.08%
710044308	Female	40-49	$40k-$60k	Blue	$1,594	$2,480	2	6.86%
823629333	Male	40-49	$40k-$60k	Blue	$ 0	$4,220	1	6.29%
787467858	Male	40-49	$80k-$120k	Silver	$2,045	$4,081	3	6.26%
710632683	Male	30-39	$40k-$60k	Blue	$ 0	$2,308	3	6.23%
708664008	Male	50-59	$80k-$120k	Blue	$ 0	$1,222	4	6.16%
718627458	Female	40-49	Unknown	Blue	$ 0	$2,119	2	6.15%
788465583	Female	40-49	$40k-$60k	Blue	$1,573	$3,029	3	6.07%
709879458	Female	50-59	Unknown	Blue	$1,376	$1,881	3	6.05%
827111283	Male	40-49	$80k-$120k	Blue	$578	$1,109	2	6.05%
716436483	Female	40-49	<$40k	Blue	$ 0	$2,521	3	6.02%
721306908	Female	60+	Unknown	Blue	$540	$3,440	2	5.76%
709106358	Male	40-49	$60k-$80k	Blue	$ 0	$816	0	5.75%
803776533	Male	40-49	$60k-$80k	Blue	$ 0	$2,118	2	5.70%
719038008	Female	40-49	<$40k	Blue	$1,192	$4,862	2	5.68%
710092683	Male	30-39	$80k-$120k	Blue	$1,421	$1,837	2	5.66%
713146683	Female	30-39	Unknown	Blue	$ 0	$5,473	2	5.64%
709253433	Female	50-59	$40k-$60k	Blue	$1,912	$2,448	2	5.51%
789270033	Male	40-49	$80k-$120k	Silver	$ 0	$1,122	3	5.50%
709531908	Male	50-59	$60k-$80k	Blue	$ 0	$2,184	2	5.43%
708741633	Male	50-59	$60k-$80k	Blue	$ 0	$1,771	4	5.42%
708655983	Female	40-49	Unknown	Blue	$ 0	$1,353	2	5.36%
720846558	Male	40-49	$80k-$120k	Blue	$ 0	$2,512	4	5.36%
779743908	Male	40-49	$60k-$80k	Blue	$ 0	$1,196	2	5.26%
756629133	Female	60+	<$40k	Blue	$ 0	$2,088	2	5.21%
714387108	Male	50-59	$80k-$120k	Blue	$ 0	$3,481	3	5.17%
779749908	Male	40-49	$60k-$80k	Gold	$2,061	$1,350	3	5.15%
711089733	Female	50-59	<$40k	Blue	$ 0	$1,058	0	5.15%
789175683	Female	40-49	<$40k	Silver	$2,517	$8,352	2	5.14%
770632758	Female	50-59	Unknown	Blue	$1,608	$1,328	2	5.14%
718086783	Male	50-59	>$120k	Blue	$ 0	$4,738	1	5.10%
793059108	Female	50-59	$40k-$60k	Blue	$1,440	$893	3	5.10%
805259733	Female	50-59	Unknown	Blue	$ 0	$1,731	4	5.08%
720476733	Female	40-49	<$40k	Blue	$799	$1,002	2	5.07%
720893958	Female	40-49	<$40k	Blue	$999	$4,215	3	5.07%
758076408	Male	50-59	$40k-$60k	Blue	$1,603	$1,590	3	5.03%
787541883	Male	40-49	$80k-$120k	Blue	$ 0	$1,120	3	5.02%
711795633	Male	60+	<$40k	Blue	$ 0	$3,798	4	4.97%
771742533	Male	40-49	$40k-$60k	Blue	$1,718	$937	2	4.95%
716211258	Female	50-59	<$40k	Blue	$ 0	$1,966	4	4.94%

The Attrition Timeline

Visualizing Customer Life Expectancy

This final section uses survival analysis to map the customer life cycle. Sudden drops in this curve indicate risk milestones - specific anniversary dates where customers are statistically most likely to reconsider their relationship with the bank. In this case, there appears to be a significant drop-off in customer retention at the three-year point. Perhaps this could be caused by new customers being offered three years of below-market financing, or other inducements. It’s certainly a good place to start further investigation.

Show code

#| label: survival-timeline

surv_obj <- attritionData %>%
  mutate(status = ifelse(attrition.flag == "Attrited Customer", 1, 0))

fit_km <- survfit(Surv(months.on.book, status) ~ 1, data = surv_obj)

ggsurvplot(fit_km, 
           data = surv_obj,
           palette = "#df691a",
           title = "Customer Retention Probability by Tenure",
           xlab = "Months on Book (Customer Lifecycle)",
           ylab = "Retention Probability",
           ggtheme = theme_minimal())

Key Findings

Analysis of more than 10,000 customer records confirms that attrition within the consumer credit card division is rarely a sudden event; rather, it is characterized by a gradual “behavioral drift.” By leveraging a Random Forest model, the identified primary predictors of churn are not demographic markers—such as age or income—but rather, are caused by other factors including declines in transaction velocity and average utilization. When a customer’s total transaction count drops or they cease maintaining a revolving balance, they may be signaling an intent to depart months before the account is formally closed.

Furthermore, survival analysis identified a critical “tenure risk” at the 36-month milestone. This suggests that as customers reach their third anniversary with the bank, initial product appeals or promotional incentives often lose their efficacy. To mitigate this risk, a structural shift in the Customer Lifecycle Management process may be needed; specifically, the implementation of automated stay-active incentives and product reviews timed for this three-year window.

This project represents only a starting point in demonstrating how machine learning and logistic regression can solve complex challenges within the credit and risk markets. By transforming data into proactive intelligence, institutions can intervene earlier and preserve valuable customer relationships.

Session Information

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.5.2 (2025-10-31)
 os       macOS Tahoe 26.2
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2026-04-30
 pandoc   3.6.3 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
 quarto   1.8.26 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 package      * version    date (UTC) lib source
 abind          1.4-8      2024-09-12 [1] CRAN (R 4.5.0)
 backports      1.5.0      2024-05-23 [1] CRAN (R 4.5.0)
 broom        * 1.0.10     2025-09-13 [1] CRAN (R 4.5.0)
 car            3.1-3      2024-09-27 [1] CRAN (R 4.5.0)
 carData        3.0-5      2022-01-06 [1] CRAN (R 4.5.0)
 cellranger     1.1.0      2016-07-27 [1] CRAN (R 4.5.0)
 class          7.3-23     2025-01-01 [1] CRAN (R 4.5.2)
 cli            3.6.5      2025-04-23 [1] CRAN (R 4.5.0)
 codetools      0.2-20     2024-03-31 [1] CRAN (R 4.5.2)
 crosstalk      1.2.2      2025-08-26 [1] CRAN (R 4.5.0)
 data.table     1.17.8     2025-07-10 [1] CRAN (R 4.5.0)
 dials        * 1.4.2      2025-09-04 [1] CRAN (R 4.5.0)
 DiceDesign     1.10       2023-12-07 [1] CRAN (R 4.5.0)
 digest         0.6.39     2025-11-19 [1] CRAN (R 4.5.2)
 dplyr        * 1.1.4      2023-11-17 [1] CRAN (R 4.5.0)
 evaluate       1.0.5      2025-08-27 [1] CRAN (R 4.5.0)
 farver         2.1.2      2024-05-13 [1] CRAN (R 4.5.0)
 fastmap        1.2.0      2024-05-15 [1] CRAN (R 4.5.0)
 forcats      * 1.0.1      2025-09-25 [1] CRAN (R 4.5.0)
 foreach        1.5.2      2022-02-02 [1] CRAN (R 4.5.0)
 formattable  * 0.2.1      2021-01-07 [1] CRAN (R 4.5.0)
 Formula        1.2-5      2023-02-24 [1] CRAN (R 4.5.0)
 furrr          0.3.1      2022-08-15 [1] CRAN (R 4.5.0)
 future         1.68.0     2025-11-17 [1] CRAN (R 4.5.2)
 future.apply   1.20.0     2025-06-06 [1] CRAN (R 4.5.0)
 generics       0.1.4      2025-05-09 [1] CRAN (R 4.5.0)
 ggplot2      * 4.0.2      2026-02-03 [1] CRAN (R 4.5.2)
 ggpubr       * 0.6.2      2025-10-17 [1] CRAN (R 4.5.0)
 ggsignif       0.6.4      2022-10-13 [1] CRAN (R 4.5.0)
 globals        0.18.0     2025-05-08 [1] CRAN (R 4.5.0)
 glue           1.8.0      2024-09-30 [1] CRAN (R 4.5.0)
 gower          1.0.2      2024-12-17 [1] CRAN (R 4.5.0)
 GPfit          1.0-9      2025-04-12 [1] CRAN (R 4.5.0)
 gridExtra      2.3        2017-09-09 [1] CRAN (R 4.5.0)
 gtable         0.3.6      2024-10-25 [1] CRAN (R 4.5.0)
 hardhat        1.4.2      2025-08-20 [1] CRAN (R 4.5.0)
 hms            1.1.4      2025-10-17 [1] CRAN (R 4.5.0)
 htmltools      0.5.8.1    2024-04-04 [1] CRAN (R 4.5.0)
 htmlwidgets    1.6.4      2023-12-06 [1] CRAN (R 4.5.0)
 httr           1.4.7      2023-08-15 [1] CRAN (R 4.5.0)
 infer        * 1.0.9      2025-06-26 [1] CRAN (R 4.5.0)
 ipred          0.9-15     2024-07-18 [1] CRAN (R 4.5.0)
 iterators      1.0.14     2022-02-05 [1] CRAN (R 4.5.0)
 jsonlite       2.0.0      2025-03-27 [1] CRAN (R 4.5.0)
 kableExtra   * 1.4.0      2024-01-24 [1] CRAN (R 4.5.0)
 km.ci          0.5-6      2022-04-06 [1] CRAN (R 4.5.0)
 KMsurv         0.1-6      2025-05-20 [1] CRAN (R 4.5.0)
 knitr          1.50       2025-03-16 [1] CRAN (R 4.5.0)
 labeling       0.4.3      2023-08-29 [1] CRAN (R 4.5.0)
 lattice        0.22-7     2025-04-02 [1] CRAN (R 4.5.2)
 lava           1.8.2      2025-10-30 [1] CRAN (R 4.5.0)
 lazyeval       0.2.2      2019-03-15 [1] CRAN (R 4.5.0)
 lhs            1.2.0      2024-06-30 [1] CRAN (R 4.5.0)
 lifecycle      1.0.5      2026-01-08 [1] CRAN (R 4.5.2)
 listenv        0.10.0     2025-11-02 [1] CRAN (R 4.5.0)
 lubridate    * 1.9.4      2024-12-08 [1] CRAN (R 4.5.0)
 magrittr       2.0.4      2025-09-12 [1] CRAN (R 4.5.0)
 MASS           7.3-65     2025-02-28 [1] CRAN (R 4.5.2)
 Matrix         1.7-4      2025-08-28 [1] CRAN (R 4.5.2)
 modeldata    * 1.5.1      2025-08-22 [1] CRAN (R 4.5.0)
 nnet           7.3-20     2025-01-01 [1] CRAN (R 4.5.2)
 parallelly     1.45.1     2025-07-24 [1] CRAN (R 4.5.0)
 parsnip      * 1.3.3      2025-08-31 [1] CRAN (R 4.5.0)
 patchwork    * 1.3.2      2025-08-25 [1] CRAN (R 4.5.0)
 pillar         1.11.1     2025-09-17 [1] CRAN (R 4.5.0)
 pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.5.0)
 plotly       * 4.11.0     2025-06-19 [1] CRAN (R 4.5.0)
 prodlim        2025.04.28 2025-04-28 [1] CRAN (R 4.5.0)
 purrr        * 1.2.0      2025-11-04 [1] CRAN (R 4.5.0)
 R6             2.6.1      2025-02-15 [1] CRAN (R 4.5.0)
 ranger         0.17.0     2024-11-08 [1] CRAN (R 4.5.0)
 RColorBrewer   1.1-3      2022-04-03 [1] CRAN (R 4.5.0)
 Rcpp           1.1.0      2025-07-02 [1] CRAN (R 4.5.0)
 readr        * 2.1.5      2024-01-10 [1] CRAN (R 4.5.0)
 readxl       * 1.4.5      2025-03-07 [1] CRAN (R 4.5.0)
 recipes      * 1.3.1      2025-05-21 [1] CRAN (R 4.5.0)
 rlang          1.1.7      2026-01-09 [1] CRAN (R 4.5.2)
 rmarkdown      2.30       2025-09-28 [1] CRAN (R 4.5.0)
 rpart          4.1.24     2025-01-07 [1] CRAN (R 4.5.2)
 rsample      * 1.3.1      2025-07-29 [1] CRAN (R 4.5.0)
 rstatix        0.7.3      2025-10-18 [1] CRAN (R 4.5.0)
 rstudioapi     0.17.1     2024-10-22 [1] CRAN (R 4.5.0)
 S7             0.2.1      2025-11-14 [1] CRAN (R 4.5.2)
 scales       * 1.4.0      2025-04-24 [1] CRAN (R 4.5.0)
 sessioninfo  * 1.2.3      2025-02-05 [1] CRAN (R 4.5.0)
 sparsevctrs    0.3.4      2025-05-25 [1] CRAN (R 4.5.0)
 stringi        1.8.7      2025-03-27 [1] CRAN (R 4.5.0)
 stringr      * 1.6.0      2025-11-04 [1] CRAN (R 4.5.0)
 survival     * 3.8-6      2026-01-16 [1] CRAN (R 4.5.2)
 survminer    * 0.5.1      2025-09-02 [1] CRAN (R 4.5.0)
 survMisc       0.5.6      2022-04-07 [1] CRAN (R 4.5.0)
 svglite        2.2.2      2025-10-21 [1] CRAN (R 4.5.0)
 systemfonts    1.3.1      2025-10-01 [1] CRAN (R 4.5.0)
 tailor       * 0.1.0      2025-08-25 [1] CRAN (R 4.5.0)
 textshaping    1.0.4      2025-10-10 [1] CRAN (R 4.5.0)
 tibble       * 3.3.0      2025-06-08 [1] CRAN (R 4.5.0)
 tidymodels   * 1.4.1      2025-09-08 [1] CRAN (R 4.5.0)
 tidyr        * 1.3.1      2024-01-24 [1] CRAN (R 4.5.0)
 tidyselect     1.2.1      2024-03-11 [1] CRAN (R 4.5.0)
 tidyverse    * 2.0.0      2023-02-22 [1] CRAN (R 4.5.0)
 timechange     0.3.0      2024-01-18 [1] CRAN (R 4.5.0)
 timeDate       4051.111   2025-10-17 [1] CRAN (R 4.5.0)
 tune         * 2.0.1      2025-10-17 [1] CRAN (R 4.5.0)
 tzdb           0.5.0      2025-03-15 [1] CRAN (R 4.5.0)
 vctrs          0.7.1      2026-01-23 [1] CRAN (R 4.5.2)
 vip          * 0.4.5      2025-12-12 [1] CRAN (R 4.5.2)
 viridisLite    0.4.3      2026-02-04 [1] CRAN (R 4.5.2)
 withr          3.0.2      2024-10-28 [1] CRAN (R 4.5.0)
 workflows    * 1.3.0      2025-08-27 [1] CRAN (R 4.5.0)
 workflowsets * 1.1.1      2025-05-27 [1] CRAN (R 4.5.0)
 xfun           0.54       2025-10-30 [1] CRAN (R 4.5.0)
 xml2           1.4.1      2025-10-27 [1] CRAN (R 4.5.0)
 xtable         1.8-4      2019-04-21 [1] CRAN (R 4.5.0)
 yaml           2.3.10     2024-07-26 [1] CRAN (R 4.5.0)
 yardstick    * 1.3.2      2025-01-22 [1] CRAN (R 4.5.0)
 zoo            1.8-14     2025-04-10 [1] CRAN (R 4.5.0)

 [1] /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library
 * ── Packages attached to the search path.

──────────────────────────────────────────────────────────────────────────────

*Rendered with Quarto · Packages: kableExtra patchwork plotly readxl sessioninfo survival survminer tidyverse tidymodels vip