Introduction
In today’s threat landscape, organizations face an overwhelming deluge of Common Vulnerabilities and Exposures (CVEs) published daily. For executive risk and operations teams, the current “patch everything” mentality is no longer viable as it depletes critical IT resources, induces alert fatigue, and fails to meaningfully reduce material business risk. Despite the noisy volume of reported vulnerabilities, existent data indicates that only a small fraction are successfully exploited by threat actors in the wild.
This analysis bridges the gap between raw theoretical data and actionable risk intelligence by utilizing innovative machine learning techniques, specifically a Random Forest classification model. By synthesizing thousands of data points from standard government and industry sources, this model accurately predicts the likelihood of a vulnerability being exploited. This prognostic capability empowers security operations to shift from a reactive, compliance-driven posture to a proactive, risk-based prioritization strategy. Ultimately, this ensures that resources are allocated specifically to the threats that pose the most danger to business continuity.
While thousands of Common Vulnerabilities and Exposures (CVEs) are published annually, only a fraction are actively exploited in the wild. This analysis utilizes innovative machine learning (Random Forest) to predict which vulnerabilities pose a authentic threat, allowing security teams to shift from a “patch everything” mentality to a risk-based prioritization model.
Display code
library(ggplot2)
library(gt)
library(httr2)
library(janitor)
library(jsonlite)
library(kableExtra)
library(lubridate)
library(plotly)
library(skimr)
library(themis)
library(tidymodels)
library(tidyverse)
library(vip)
readLiveData = FALSE # If TRUE, read EPSS, KEV & NVD data live via API; if FALSE, read pre-loaded data via ""data" folder
Data Acquisition
The foundation of any strong prognostic risk model is the quality, timeliness, and diversity of its underlying data. In this phase, we programmatically ingest and aggregate threat intelligence from three premier, real time cybersecurity sources: NIST’s National Vulnerability Database (NVD) for core vulnerability characteristics, CISA’s Known Exploited Vulnerabilities (KEV) catalog for exploitation data, and the Exploit Prediction Scoring System (EPSS) for probabilistic threat assessments. Combining these datasets into a single dataset provides a comprehensive, multi-dimensional view of the threat landscape, allowing the model to detect complex patterns that a human analyst might miss.
1. Download CISA Key Exploited Vulnerabilities (KEV) Catalog data
The CISA Known Exploited Vulnerabilities (KEV) Catalog is a dynamic list maintained by the U.S. Cybersecurity and Infrastructure Security Agency. It aggregates CVEs confirmed to be actively exploited in the wild, shifting focus from theoretical risk to real-world threats. This authoritative resource helps organizations prioritize patching effectively. While mandatory for U.S. federal agencies under Binding Operational Directive 22-01, the catalog is an essential tool for any organization seeking to reduce its attack surface against active adversaries.For more information on the CISA Key Exploited Vulnerabilities Catalog: CISA Key Exploited Vulnerabilities Catalog
Display code
if (readLiveData == TRUE) {
kev_url <- "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
kev_raw <- fromJSON(kev_url)
kev_data <- kev_raw$vulnerabilities |>
clean_names() |>
select(cve_id, date_added, due_date, known_ransomware_campaign_use) |>
mutate(is_exploited = TRUE)
} else {kev_data <- read_csv("data/kev_data.csv")
}
tbl_data <- kev_data |>
slice(1:6) |>
rename("CVE ID" = cve_id,
"Date Added" = date_added,
"Due Date" = due_date,
"Known Ransomware Campaign Use" = known_ransomware_campaign_use,
"Is Exploited" = is_exploited)
kable(tbl_data,
caption = "CISA Known Exploited Vulnerabilities (KEV) Catalog: First Six Rows",
format = "html") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", font_size = 7, full_width = F))
CISA Known Exploited Vulnerabilities (KEV) Catalog: First Six Rows
| CVE-2018-14634 |
2026-01-26 |
2026-02-16 |
Unknown |
TRUE |
| CVE-2025-52691 |
2026-01-26 |
2026-02-16 |
Unknown |
TRUE |
| CVE-2026-23760 |
2026-01-26 |
2026-02-16 |
Unknown |
TRUE |
| CVE-2026-24061 |
2026-01-26 |
2026-02-16 |
Unknown |
TRUE |
| CVE-2026-21509 |
2026-01-26 |
2026-02-16 |
Unknown |
TRUE |
| CVE-2024-37079 |
2026-01-23 |
2026-02-13 |
Unknown |
TRUE |
2. Download NIST National Vulnerability Database (NVD)
The National Vulnerability Database (NVD) is the U.S. government’s central repository for standards-based vulnerability management data, maintained by the National Institute of Standards and Technology (NIST). It enriches the MITRE CVE list with detailed analysis, including CVSS severity scores and affected product configurations, enabling automation in vulnerability management. While the CISA KEV catalog identifies only those threats with confirmed active exploitation, the NVD functions as an exhaustive encyclopedia containing every reported software vulnerability, regardless of its immediate real-world threat status. For more information on the NIST National Vulnerabilities Database: NIST National Vulnerability Database
Display code
### 1.2 FETCH and WRANGLE NVD VULNERABILITY DATA (365 DAYS via PAGINATION)
if(readLiveData == TRUE) {
date_intervals <- tibble(
start = c(Sys.Date() - 360, Sys.Date() - 270, Sys.Date() - 180, Sys.Date() - 90),
end = c(Sys.Date() - 271, Sys.Date() - 181, Sys.Date() - 91, Sys.Date())
)
fetch_nvd_chunk <- function(start_date, end_date) {
nist_start <- paste0(start_date, "T00:00:00.000")
nist_end <- paste0(end_date, "T23:59:59.000")
req <- request("https://services.nvd.nist.gov/rest/json/cves/2.0") |>
req_url_query(pubStartDate = nist_start, pubEndDate = nist_end) |>
req_headers(apiKey = Sys.getenv("NIST_API_KEY")) |>
req_retry(max_tries = 3) |>
req_throttle(rate = 50 / 30)
resp <- req_perform(req)
resp_body_json(resp)
}
nvd_raw_list <- map2(date_intervals$start, date_intervals$end, fetch_nvd_chunk)
extract_nvd_features <- function(item) {
# Some CVEs in NVD don't have CVSS metrics yet. We use a safe extractor.
metrics <- item$cve$metrics$cvssMetricV31[[1]]$cvssData
tibble(
cve_id = item$cve$id,
published_date = as.Date(item$cve$published),
base_score = metrics$baseScore,
base_severity = metrics$baseSeverity,
attack_vector = metrics$attackVector,
attack_complexity = metrics$attackComplexity,
privileges_required = metrics$privilegesRequired,
user_interaction = metrics$userInteraction
)
}
# Apply the function to the list of chunks
nvd_flat <- map_df(nvd_raw_list, ~ map_df(.x$vulnerabilities, extract_nvd_features))
} else {nvd_flat <- read_csv("data/nvd_flat.csv")
}
tbl_data <- nvd_flat |>
slice(1:6) |>
rename("CVE ID" = cve_id,
"Published Date" = published_date,
"Base Score" = base_score,
"Base Severity" = base_severity,
"Attack Vector" = attack_vector,
"Attack Complexity" = attack_complexity,
"Privileges Required" = privileges_required,
"User Interaction" = user_interaction)
kable(tbl_data,
caption = "NIST National Vulnerability Database (NVD): First Six Rows",
format = "html") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
NIST National Vulnerability Database (NVD): First Six Rows
| CVE-2024-11780 |
2025-02-01 |
6.4 |
MEDIUM |
NETWORK |
LOW |
LOW |
NONE |
| CVE-2024-12171 |
2025-02-01 |
8.8 |
HIGH |
NETWORK |
LOW |
LOW |
NONE |
| CVE-2024-12184 |
2025-02-01 |
5.3 |
MEDIUM |
NETWORK |
LOW |
NONE |
NONE |
| CVE-2024-12620 |
2025-02-01 |
5.3 |
MEDIUM |
NETWORK |
LOW |
NONE |
NONE |
| CVE-2024-13343 |
2025-02-01 |
8.8 |
HIGH |
NETWORK |
LOW |
LOW |
NONE |
| CVE-2024-13547 |
2025-02-01 |
6.4 |
MEDIUM |
NETWORK |
LOW |
LOW |
NONE |
3. Download Exploit Prediction Scoring System (EPSS) data
The Exploit Prediction Scoring System (EPSS) is a data-driven effort for estimating the likelihood (probability) that a software vulnerability will be exploited in the wild. While other industry standards have been useful for capturing innate characteristics of a vulnerability and provide measures of severity, they are limited in their ability to assess threat. EPSS fills that gap because it uses current threat information from CVE and real-world exploit data. The EPSS model produces a probability score between 0 and 1 (0 and 100%). The higher the score, the greater the probability that a vulnerability will be exploited. For more information on the EPSS Scoring System: EPSS Scoring System
Display code
if (readLiveData == TRUE) {
epss_url <- paste0("https://epss.empiricalsecurity.com/epss_scores-", Sys.Date(), ".csv.gz")
epss_data <- read_csv(epss_url, comment = "#", show_col_types = FALSE) |>
clean_names() |>
rename(epss_score = epss)
} else {epss_data <- read_csv("data/epss_data.csv")
}
tbl_data <- epss_data |>
slice(1:6) |>
rename("CVE ID" = cve,
"EPSS Score" = epss_score,
"Percentile" = percentile)
kable(tbl_data,
caption = "EPSS Scores: First Six Rows",
format = "html") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", font_size = 7, full_width = F))
EPSS Scores: First Six Rows
| CVE-1999-0001 |
0.01151 |
0.78081 |
| CVE-1999-0002 |
0.09123 |
0.92445 |
| CVE-1999-0003 |
0.89352 |
0.99527 |
| CVE-1999-0004 |
0.03037 |
0.86289 |
| CVE-1999-0005 |
0.13652 |
0.94053 |
| CVE-1999-0006 |
0.08244 |
0.91994 |
4. Create the Comprehensive Dataset from the Three Disparate Datasets
Display code
ml_dataset <- nvd_flat |>
left_join(epss_data, by = c("cve_id" = "cve")) |>
left_join(kev_data, by = "cve_id") |>
mutate(
is_exploited = replace_na(is_exploited, FALSE),
days_since_pub = as.numeric(Sys.Date() - published_date)
)
tbl_data <- ml_dataset |>
slice(1:6) |>
rename ("CVE ID" = cve_id,
"Published Date" = published_date,
"Base Score" = base_score,
"Base Severity" = base_severity,
"Attack Vector" = attack_vector,
"Attack Complexity" = attack_complexity,
"Privileges Required" = privileges_required,
"User Interaction" = user_interaction,
"EPSS Score" = epss_score,
"Percentile" = percentile,
"Date Added" = date_added,
"Due Date" = due_date,
"Known Ransonware Campaign Use" = known_ransomware_campaign_use,
"Is Exploited" = is_exploited,
"Days Since Published" = days_since_pub)
kable(tbl_data,
caption = "Comprehensive Dataset: First Six Rows",
format = "html") |>
kable_styling(bootstrap_options = c("striped", "hover", "responsive", full_width = T))
Comprehensive Dataset: First Six Rows
| CVE-2024-11780 |
2025-02-01 |
6.4 |
MEDIUM |
NETWORK |
LOW |
LOW |
NONE |
0.00077 |
0.23040 |
NA |
NA |
NA |
FALSE |
453 |
| CVE-2024-12171 |
2025-02-01 |
8.8 |
HIGH |
NETWORK |
LOW |
LOW |
NONE |
0.00208 |
0.43092 |
NA |
NA |
NA |
FALSE |
453 |
| CVE-2024-12184 |
2025-02-01 |
5.3 |
MEDIUM |
NETWORK |
LOW |
NONE |
NONE |
0.00328 |
0.55221 |
NA |
NA |
NA |
FALSE |
453 |
| CVE-2024-12620 |
2025-02-01 |
5.3 |
MEDIUM |
NETWORK |
LOW |
NONE |
NONE |
0.00379 |
0.58824 |
NA |
NA |
NA |
FALSE |
453 |
| CVE-2024-13343 |
2025-02-01 |
8.8 |
HIGH |
NETWORK |
LOW |
LOW |
NONE |
0.00176 |
0.39195 |
NA |
NA |
NA |
FALSE |
453 |
| CVE-2024-13547 |
2025-02-01 |
6.4 |
MEDIUM |
NETWORK |
LOW |
LOW |
NONE |
0.00077 |
0.23040 |
NA |
NA |
NA |
FALSE |
453 |
Feature Engineering & Exploratory Data Analysis
Raw threat data is rarely ready for advanced modeling. In Phase 2, we perform Feature Engineering and Exploratory Data Analysis (EDA) to transform disparate metrics into high-quality predictive signals. We clean inconsistencies and impute missing values to ensure integrity. Most importantly, we visualize the critical “class imbalance”—the reality that while thousands of vulnerabilities exist, only a tiny fraction are actually exploited or weaponized. Understanding this disparity is vital for tuning the model to detect rare, high-impact threats without generating excessive false alarms.
Display code
# 1. FEATURE ENGINEERING & DATA TYPING
cve_features <- ml_dataset |>
# Filter out any malformed data (e.g., CVEs without base scores)
filter(!is.na(base_score)) |>
mutate(
# Ensure dates are properly formatted
published_date = as.Date(published_date),
# Feature 1: Time Decay (older CVEs may be less likely to be newly exploited)
days_since_pub = as.numeric(Sys.Date() - published_date),
# Feature 2: Is the EPSS score missing? (If so, impute with 0 or mean)
epss_score = replace_na(epss_score, 0),
# Convert character strings into categorical Factors for Machine Learning
base_severity = factor(base_severity, levels = c("LOW", "MEDIUM", "HIGH", "CRITICAL")),
attack_vector = as.factor(attack_vector),
attack_complexity = as.factor(attack_complexity),
privileges_required = as.factor(privileges_required),
user_interaction = as.factor(user_interaction),
# Ensure Target Variable is a factor for classification
is_exploited = as.factor(is_exploited)
) |>
# Drop redundant or non-predictive columns for the model
select(-cve_id, -date_added, -due_date, -known_ransomware_campaign_use)
The table below functions as a comprehensive “health check” for our dataset before advanced modeling begins. It provides a transparent inventory of every variable, categorizing them by type (e.g., numeric scores, logical indicators, or categories) and calculating key statistics like averages and distributions.
Crucially, this summary highlights the “Completion Rate” (n_missing), allowing us to verify that our prior data cleaning processes successfully resolved any gaps or errors. For stakeholders, this step validates the integrity of the raw materials used in our analysis. Just as a financial audit ensures accurate accounting, this summary confirms our risk model is built upon a foundation of complete, high-quality intelligence.
Display code
# 2. EXPLORATORY DATA ANALYSIS (EDA) & CLASS IMBALANCE CHECK
# In Quarto, 'skimr' creates a readable summary table of the data.
skim(cve_features)
Data summary
| Name |
cve_features |
| Number of rows |
7150 |
| Number of columns |
11 |
| _______________________ |
|
| Column type frequency: |
|
| Date |
1 |
| factor |
6 |
| numeric |
4 |
| ________________________ |
|
| Group variables |
None |
Variable type: Date
| published_date |
0 |
1 |
2025-02-01 |
2025-11-12 |
2025-05-15 |
63 |
Variable type: factor
| base_severity |
1 |
1 |
FALSE |
4 |
MED: 3493, HIG: 2666, CRI: 628, LOW: 362 |
| attack_vector |
0 |
1 |
FALSE |
4 |
NET: 5227, LOC: 1641, ADJ: 208, PHY: 74 |
| attack_complexity |
0 |
1 |
FALSE |
2 |
LOW: 6318, HIG: 832 |
| privileges_required |
0 |
1 |
FALSE |
3 |
NON: 3602, LOW: 2760, HIG: 788 |
| user_interaction |
0 |
1 |
FALSE |
2 |
NON: 4930, REQ: 2220 |
| is_exploited |
0 |
1 |
FALSE |
2 |
FAL: 7113, TRU: 37 |
Variable type: numeric
| base_score |
0 |
1 |
6.66 |
1.69 |
0 |
5.40 |
6.50 |
7.80 |
10.00 |
▁▁▅▇▃ |
| epss_score |
0 |
1 |
0.01 |
0.06 |
0 |
0.00 |
0.00 |
0.00 |
0.94 |
▇▁▁▁▁ |
| percentile |
0 |
1 |
0.24 |
0.22 |
0 |
0.08 |
0.17 |
0.31 |
1.00 |
▇▃▁▁▁ |
| days_since_pub |
0 |
1 |
311.85 |
100.03 |
169 |
258.00 |
350.00 |
436.00 |
453.00 |
▇▇▁▇▇ |
The bar chart below visually demonstrates the core challenge in vulnerability management: while thousands of software flaws exist, only a tiny fraction are ever weaponized by attackers. The massive disparity between the tall “Non-Exploited” bar and the small “Exploited” sliver proves that a “patch everything” strategy is inefficient. This data validates our need for a targeted AI model to pinpoint the few critical threats hiding within the noise. Note: The y-axis is log-scaled so that viewer can get a better sense of the values.
Display code
# 3. VISUALIZING THE RISK IMBALANCE
# It is critical to understand that most vulnerabilities are NOT exploited.
# This visualization proves the need for our ML model.
eda_plot <- ggplot(cve_features, aes(x = is_exploited, fill = is_exploited)) +
geom_bar(alpha = 0.8) +
scale_fill_manual(values = c("#2c3e50", "#e74c3c")) + # Professional color palette
geom_text(
stat = "count",
aes(label = scales::comma(after_stat(count))),
vjust = 2.0,
size = 3.5,
color = "white", # Change font color to white
) +
scale_y_continuous(trans='log10', labels = scales::comma) +
labs(
title = "Class Imbalance: Exploited vs. Non-Exploited Vulnerabilities",
subtitle = "The vast majority (99.5%) of NVD vulnerabilities are never exploited in the wild.",
x = "Is Exploited (per CISA KEV)",
y = "Count of CVEs (log scale)",
fill = "Exploited?"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
eda_plot
Machine Language Modeling
In Phase 3, we transition from preparation to predictive modeling. One of the challenges in cybersecurity risk analysis is measure the extreme rarity of actual exploitation. To overcome this, we implement the Synthetic Minority Over Sampling Technique (SMOTE), which mathematically balances the training data, forcing the algorithm to learn the nuanced characteristics of true threats. We then train a Random Forest classifier a robust, ensemble learning method capable of detecting complex, non linear patterns across our features. This training process ensures the model is not just memorizing data, but learning to generalize risk.
Display code
## 1. Data Splitting - Train / Test
set.seed(42) # Set a seed for reproducibility
# Split the 'cve_features' dataframe from Phase 2 (80% train, 20% test)
# Stratify by our target variable to maintain the same ratio of exploited CVEs
cve_split <- initial_split(cve_features, prop = 0.80, strata = is_exploited)
cve_train <- training(cve_split)
cve_test <- testing(cve_split)
# Create 10-fold cross-validation folds for model evaluation
cve_folds <- vfold_cv(cve_train, v = 10, strata = is_exploited)
print(paste("Training set:", nrow(cve_train), "CVEs. Testing set:", nrow(cve_test), "CVEs."))
#> [1] "Training set: 5720 CVEs. Testing set: 1430 CVEs."
Display code
# 2. MODEL SPECIFICATION (Random Forest)
rf_spec <- rand_forest(trees = 500) |>
set_engine("ranger", importance = "impurity") |>
set_mode("classification")
# 3. RECIPE DEFINITION (With Date Removal)
cve_recipe <- recipe(is_exploited ~ ., data = cve_train) |>
# Remove the Date column since we already have 'days_since_pub'
step_rm(published_date) |>
# Treat NAs in base_severity as a new category called "unknown"
step_unknown(base_severity) |>
# Handle any new factor levels that might appear in future data
step_novel(all_nominal_predictors()) |>
# One-Hot Encoding: Convert categorical factors into dummy variables
step_dummy(all_nominal_predictors()) |>
# Remove any variables that have zero variance (no predictive value)
step_zv(all_predictors()) |>
# Normalize numeric features (epss_score, base_score, days_since_pub)
step_normalize(all_numeric_predictors()) |>
# Address class imbalance by oversampling the exploited class
step_smote(is_exploited)
# 4. WORKFLOW & TRAINING
# Combine model and recipe into a single workflow
cve_workflow <- workflow() |>
add_model(rf_spec) |>
add_recipe(cve_recipe)
# Train and evaluate the model using K-fold cross-validation
rf_resamples <- fit_resamples(
cve_workflow,
resamples = cve_folds,
control = control_resamples(save_pred = TRUE)
)
The table below serves as the model’s internal “report card” generated during the training phase. Rather than relying on a single test, these results represent the average performance across ten separate simulations to ensure reliability. Key metrics include Accuracy, which measures the percentage of correct predictions, and ROC_AUC, which scores the model’s ability to clearly distinguish between harmless and dangerous vulnerabilities. High values here confirm the model is consistent, robust, and ready for real-world application.
Display code
# Show the performance metrics
collect_metrics(rf_resamples)
| accuracy |
binary |
0.9949301 |
10 |
0.0008819 |
pre0_mod0_post0 |
| brier_class |
binary |
0.0046705 |
10 |
0.0005472 |
pre0_mod0_post0 |
| roc_auc |
binary |
0.9909070 |
10 |
0.0027999 |
pre0_mod0_post0 |
Evaluation & Business Insights
In this final phase, we transition from theoretical training to real-world validation. We test the model against a “holdout” dataset—vulnerabilities the system has never seen before—to prove its reliability in a live environment. Beyond abstract accuracy scores, we visualize the critical trade-off between “false alarms” (which waste resources) and “missed threats” (which introduce risk). This evaluation confirms the model is not only statistically robust but also operationally transparent and ready for deployment.
The table below represents the model’s “final exam” results, tested against a holdout dataset of vulnerabilities it had never encountered during training. The two critical metrics here are Accuracy and ROC_AUC. Accuracy measures the raw percentage of correct predictions, while ROC_AUC serves as a reliability score, indicating how well the model separates legitimate threats from false alarms. The high percent values for both confirm that the model hasn’t just memorized historical data but has learned to accurately forecast risk in a live, dynamic environment.
Display code
# 1. FINAL FIT (Evaluate on the Test Set)
# print("Fitting final model and predicting on the holdout test set...")
# 'last_fit' fits on the training data and evaluates on the test data defined in 'cve_split'
final_fit <- last_fit(cve_workflow, split = cve_split)
# View standard performance metrics (Accuracy, ROC_AUC) on the test data
final_metrics <- collect_metrics(final_fit)
tbl_data <- final_metrics |>
rename(Metric = .metric,
Estimator = .estimator,
estimate = .estimate,
Configuration = .config)
tbl_data$Estimate = paste0(round(tbl_data$estimate, 4) * 100, "%")
tbl_data <- tbl_data |>
select(Metric, Estimator, Estimate, Configuration)
kable(tbl_data,
caption = "Final Metrics",
format = "html") |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", font_size = 7, full_width = F))
Final Metrics
| accuracy |
binary |
99.44% |
pre0_mod0_post0 |
| roc_auc |
binary |
99.32% |
pre0_mod0_post0 |
| brier_class |
binary |
0.44% |
pre0_mod0_post0 |
The heatmap below visualizes the operational reality of deploying the model. It compares our predictions against actual outcomes to reveal the cost of errors. The critical area for risk management is the False Negatives (missed threats), which represent exploited vulnerabilities that slipped through the cracks. Conversely, False Positives represent “false alarms” that waste remediation resources. This view helps leadership decide if the model is calibrated correctly to balance safety against efficiency.
Display code
# 2. THE CONFUSION MATRIX
# Extract the predictions
test_predictions <- collect_predictions(final_fit)
# Generate and plot the Confusion Matrix
conf_matrix_plot <- test_predictions |>
conf_mat(truth = is_exploited, estimate = .pred_class) |>
autoplot(type = "heatmap") +
labs(
title = "Confusion Matrix: Test Set Results",
subtitle = "Assessing False Positives (Wasted Effort) vs. False Negatives (Missed Threats)"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
# Render Confusion Matrix
print(conf_matrix_plot)
The Receiver Operating Characteristic (ROC) curve illustrates the model’s overall predictive power across different thresholds. It visualizes the trade-off between “catching true threats” (Sensitivity) and “avoiding false alarms” (Specificity). A curve that hugs the top-left corner indicates a superior model that successfully separates dangerous vulnerabilities from harmless ones. The Area Under the Curve (AUC) serves as a single quality score—the closer to 1.0, the more reliable our strategic risk predictions are.
Display code
# 3. RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE
# Visualizes the trade-off between sensitivity and specificity
roc_plot <- test_predictions |>
roc_curve(is_exploited, .pred_TRUE) |>
autoplot() +
labs(
title = "ROC Curve: Predictive Performance",
subtitle = "A curve closer to the top-left indicates superior classification capability."
) +
theme_minimal()
# Render ROC Curve
print(roc_plot)
The chart below offers a transparent look inside the model’s decision-making process. It ranks the top specific features—such as EPSS scores, base severity ratings, or attack vectors—that the algorithm found most valuable when predicting exploitation. What’s most interesting is that the EPSS score and Percentile are, by far, the leading drivers of exploitation risk, whereas the actual risk score / severity grades (Critical, High, Medium & Low) don’t seem to have much effect on predicting exploitation risk within the current model.
For leadership, this visualization is critical because it moves beyond a simple “risk score” to explain why a vulnerability is flagged. By identifying these primary risk drivers, security teams can understand the root causes of threats and tailor their defense strategies to focus on the specific characteristics that matter most in the wild.
Display code
# 4. VARIABLE IMPORTANCE (VIP)
#| label: phase-4-vip
# Extract the fitted model from the workflow
final_tree <- extract_fit_parsnip(final_fit)
# 5. Extract the raw importance scores into a dataframe
importance_scores <- vi(final_tree) |>
slice_max(Importance, n = 10) |>
mutate(Variable01 = c("EPSS Score",
"Percentile",
"Severity: MEDIUM",
"Base Score",
"User Interaction: Required",
"Severity: CRITICAL",
"Days Since Published",
"Privileges Required: NONE",
"Severity: HIGH",
"Severity: LOW")) |>
select(Variable01, Importance) |>
rename(Variable = Variable01)
# 6. Plot top 10 features using native ggplot
ggplot(importance_scores |> slice_max(Importance, n = 10),
aes(x = Importance, y = reorder(Variable, Importance))) +
# Create the bars with the professional dark blue color
geom_col(fill = "#2c3e50", alpha = 0.9) +
# Add white text labels inside the bars
geom_text(
aes(label = round(Importance, 2)),
hjust = - 0.25, # Move text inside the right edge of the bar
color = "black", # White text
size = 3
) +
# Formatting and Labels
labs(
title = "Drivers of Exploitation Risk",
subtitle = "Top 10 features utilized by the model to predict exploitation",
y = "Vulnerability Feature",
x = "Importance (Impurity Reduction)"
) +
xlim(0, 2500) +
theme_minimal() +
theme(panel.grid.major.y = element_blank() # Remove distracting horizontal grid lines
)
Insights & Conclusion
The analysis presented here helps moves vulnerability management from a volume-based compliance task to a precision-based risk operation. By integrating historical exploitation data with predictive modeling, we have demonstrated that it is possible to identify the small percentage of vulnerabilities that pose a genuine threat to the organization. This approach does not merely add another tool to the security stack; it helps change how resources are allocated. Instead of diluting engineering efforts across thousands of theoretical risks, teams can focus their remediation cycles on the verified threats identified by this model.
While traditional scoring methods often overestimate risk, this model utilizes real-world signals—such as active exploitation and attack complexity—to refine those assessments. This distinction is critical for leadership. It means that “high severity” no longer automatically equals “high priority.” By adopting this data-driven framework, the organization can reduce alert fatigue and operational friction. Ultimately, this ensures that limited security budgets and engineering hours are better directed toward the vulnerabilities that have the most potential to disrupt business operations, providing a clearer return on investment for the cybersecurity program.