Auditing a Pre-Trained Hugging Face Financial Risk Model

A Practical Introduction to Model Governance Using R packages hfhub & tok

Author

Patrick Lefler

Published

May 11, 2026

Abstract
Artificial intelligence governance frameworks including the Federal Reserve’s SR 11-7, NYDFS model risk guidance, and the NIST AI Risk Management Framework require institutions to document the intended use, training data provenance, known limitations, and bias considerations of any model used in consequential decisions. For open-source models sourced from public repositories, the model card is often the sole documentation available. Yet few practitioners have the tools to interrogate it programmatically.

This project introduces two R packages — hfhub and tok — and applies them to a structured metadata audit of FinBERT (ProsusAI/finbert), a BERT-based sentiment classifier widely deployed in fintech AI pipelines. Using hub_download() and tok::tokenizer, the analysis retrieves and parses model architecture parameters, tokenizer configuration, vocabulary composition, and context window constraints — without inference code, a GPU, or a Python dependency.

The result is a governance audit card aligned to SR 11-7 documentation expectations. The project is designed for risk practitioners with intermediate R proficiency who need a reproducible, board-ready template for third-party model due diligence.

Introduction

Every organization using a third-party AI model in risk or compliance faces a key question: what exactly did we approve? The model card is usually the main answer. It is the documentation published with open-source models on sites like Hugging Face. The card details training data, intended use, known limitations, and output schema. For organizations under SR 11-7, NYDFS model risk guidance, or the NIST AI Risk Management Framework, this documentation is essential. It serves as the baseline for independent validation.

The issue is that most risk practitioners view model cards as static web pages. They are reviewed manually and inconsistently. Two R packages, hfhub and tok, change this. They allow practitioners to pull model artifacts programmatically. They can also parse configuration files and create a structured audit record. This record can be reproduced, versioned, and embedded directly into governance documents.

This project illustrates the complete workflow using FinBERT (ProsusAI/finbert) as an example. FinBERT is a BERT-based sentiment classifier fine-tuned on financial news. It is one of the most cited open-source models in fintech AI and is used in firms for credit risk, market surveillance, and regulatory text analysis. Its wide use makes it a strong teaching case. The model is well-known, its limits are clear, and the governance issues it raises relate to any BERT-family model a risk team might face.

This project focuses on a metadata audit. It looks at:

  • Architecture parameters

  • Tokenizer configuration

  • Vocabulary composition

  • Context window constraints

No inference is performed, no GPU is needed, and the Python ecosystem is not required. The entire workflow runs in R.

NoteUnderstanding Hugging Face — And How It Differs from Gemini or Claude

Hugging Face is an open-source model repository and machine learning platform. It is not just one AI system; it provides distribution infrastructure. Gemini (Google) and Claude (Anthropic) are different. You can only access them through their vendor APIs. Hugging Face has thousands of models. They come from universities, companies, and independent researchers. These models are available for direct download, inspection, and often modification.

This difference matters for governance. With a commercial API, the model is a black box: inputs go in, and outputs come out, but you can’t see the architecture, training data, or configuration. You can find the configuration files, tokenizer, and often the full model weights for a Hugging Face model available to the public. This transparency allows for structured metadata audits, making tools like hfhub and tok possible as R packages.

The practical differences also affect deployment. Gemini and Claude provide inference-as-a-service. The model operates on the vendor’s infrastructure. Users pay for each token used. In contrast, organizations can download and run Hugging Face models on their own infrastructure. They can also evaluate these models without running them, as this project shows. This distinction impacts costs, latency, and data residency for financial institutions. It also shifts compliance responsibilities. When a firm uses the Claude API, it follows Anthropic’s acceptable use policy. When a firm uses FinBERT internally, it takes on all model risk management responsibility under SR 11-7. There is no vendor support. The governance requirements differ significantly.

For risk practitioners, Hugging Face is valuable in three main ways. First, it serves as an audit target, helping document a model already in use within the organization. Second, it acts as a benchmarking resource. You can find domain-specific models for credit risk, regulatory text classification, and financial entity recognition on the Hub. These can be evaluated against internal data before any procurement or build decisions. Third, it provides research signals. Model cards, citation counts, and community talks on Hugging Face show popular modeling methods in finance. This often happens 12 to 18 months before those approaches appear in vendor product announcements. A risk team that monitors the Hub regularly is better equipped to anticipate governance questions that may arise.

Background: The R Packages

hfhub — Fetching from Hugging Face Hub

The hfhub package exposes a single core function: hub_download(). Given a repository identifier and a file path, it retrieves the specified artifact from Hugging Face Hub and stores it in a local cache. The caching layout mirrors the Python huggingface_hub library, which means cached files can be shared across R and Python environments — a practical consideration for teams operating mixed-language pipelines.

# Fetch the model configuration file
path <- hfhub::hub_download("ProsusAI/finbert", "config.json")

The function returns the local file path, which can then be passed to any standard R parsing tool. In this project, that means jsonlite::fromJSON() for the JSON configuration files and base::readLines() for the plain-text vocabulary.

tok — Tokenization in R

The tok package provides R bindings for the Hugging Face tokenizers Rust library. Its primary function is to convert raw text into the integer token sequences that transformer models consume. For governance purposes, the more relevant capability is inspection: tok exposes the tokenizer configuration directly, allowing a practitioner to verify vocabulary size, maximum sequence length, special token definitions, and case normalization behavior — all parameters that govern what the model can and cannot process.

# Load the tokenizer from the Hub
tokenizer <- tok::tokenizer$from_pretrained("ProsusAI/finbert")

Neither package requires model inference. Both operate entirely on the metadata and configuration layer — which is precisely the layer that governance review demands.

Downloading Model Artifacts

Three files are retrieved from the FinBERT repository. The config.json file defines model architecture and output labels. The tokenizer_config.json file specifies how input text is processed before it reaches the model. The vocab.txt file contains the complete 30,522-token vocabulary — sampled and analyzed below, not printed in full.

Display code
model_id <- "ProsusAI/finbert"

config_path        <- hub_download(model_id, "config.json")
tokenizer_cfg_path <- hub_download(model_id, "tokenizer_config.json")
vocab_path         <- hub_download(model_id, "vocab.txt")

cat("config.json           :", config_path, "\n")
#> config.json           : /Users/patricklefler/.cache/huggingface/hub/models--ProsusAI--finbert/snapshots/4556d13015211d73dccd3fdd39d39232506f3e43/config.json
Display code
cat("tokenizer_config.json :", tokenizer_cfg_path, "\n")
#> tokenizer_config.json : /Users/patricklefler/.cache/huggingface/hub/models--ProsusAI--finbert/snapshots/4556d13015211d73dccd3fdd39d39232506f3e43/tokenizer_config.json
Display code
cat("vocab.txt             :", vocab_path, "\n")
#> vocab.txt             : /Users/patricklefler/.cache/huggingface/hub/models--ProsusAI--finbert/snapshots/4556d13015211d73dccd3fdd39d39232506f3e43/vocab.txt
NoteCaching and Reproducibility

hub_download() caches files on first retrieval. Subsequent calls return the cached path with no network request. For production governance workflows, artifacts should be pinned to a specific commit hash — not pulled from main — to ensure that audit records remain reproducible when the model card is updated.

Model Architecture

Display code
config        <- fromJSON(config_path)
tokenizer_cfg <- fromJSON(tokenizer_cfg_path)

Architecture Parameters

The configuration file records the structural parameters of the underlying BERT-base architecture. These parameters define the model’s representational capacity and, for governance purposes, its resource footprint in any deployment context.

Display code
arch_params <- tibble(
  Parameter = c(
    "Model Type",
    "Hidden Size",
    "Number of Hidden Layers",
    "Attention Heads",
    "Intermediate (FFN) Size",
    "Max Position Embeddings",
    "Vocabulary Size",
    "Hidden Dropout Probability",
    "Attention Dropout Probability"
  ),
  Value = c(
    config$model_type                    %||% "bert",
    config$hidden_size                   %||% "768",
    config$num_hidden_layers             %||% "12",
    config$num_attention_heads           %||% "12",
    config$intermediate_size             %||% "3072",
    config$max_position_embeddings       %||% "512",
    config$vocab_size                    %||% "30522",
    config$hidden_dropout_prob           %||% "0.1",
    config$attention_probs_dropout_prob  %||% "0.1"
  )
)

arch_params |>
  kable(
    format    = "html",
    caption   = "Table 1: FinBERT Architecture Parameters",
    col.names = c("Parameter", "Value"),
    align     = c("l", "r")
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 13
  ) |>
  column_spec(1, bold = TRUE, width = "18em") |>
  column_spec(2, width = "10em")
Table 1: FinBERT Architecture Parameters
Parameter Value
Model Type bert
Hidden Size 768
Number of Hidden Layers 12
Attention Heads 12
Intermediate (FFN) Size 3072
Max Position Embeddings 512
Vocabulary Size 30522
Hidden Dropout Probability 0.1
Attention Dropout Probability 0.1

Classification Labels

FinBERT produces a three-class sentiment output. The label schema below is what the model returns — understanding it is a prerequisite for any downstream risk workflow that consumes model outputs.

Display code
if (!is.null(config$id2label)) {
  label_df <- tibble(
    `Token ID` = names(config$id2label),
    Label      = unlist(config$id2label),
    `Risk Interpretation` = case_when(
      unlist(config$id2label) == "positive" ~
        "Favorable market or credit sentiment signal",
      unlist(config$id2label) == "negative" ~
        "Adverse market or credit sentiment signal",
      unlist(config$id2label) == "neutral"  ~
        "No directional signal; may warrant further manual review",
      TRUE ~ "See model documentation"
    )
  )

  label_df |>
    kable(
      format  = "html",
      caption = "Table 2: FinBERT Output Labels and Risk Interpretation",
      align   = c("c", "l", "l")
    ) |>
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      full_width        = FALSE,
      position          = "left",
      font_size         = 13
    ) |>
    column_spec(1, width = "6em") |>
    column_spec(2, bold = TRUE, width = "8em") |>
    column_spec(3, width = "26em")
}
Table 2: FinBERT Output Labels and Risk Interpretation
Token ID Label Risk Interpretation
0 positive Favorable market or credit sentiment signal
1 negative Adverse market or credit sentiment signal
2 neutral No directional signal; may warrant further manual review

Tokenizer Audit

Configuration Parameters

The tokenizer configuration specifies how raw text is transformed before it enters the model. Each parameter below has a direct governance implication: it constrains what the model can process reliably and how failures manifest when inputs fall outside the expected distribution.

Display code
tok_params <- tibble(
  Parameter = c(
    "Tokenizer Class",
    "Model Max Length",
    "Do Lower Case",
    "Padding Side",
    "Special Token: [CLS]",
    "Special Token: [SEP]",
    "Special Token: [PAD]",
    "Special Token: [UNK]",
    "Special Token: [MASK]"
  ),
  Value = c(
    tokenizer_cfg$tokenizer_class  %||% "BertTokenizer",
    tokenizer_cfg$model_max_length %||% "512",
    as.character(tokenizer_cfg$do_lower_case %||% TRUE),
    tokenizer_cfg$padding_side     %||% "right",
    "[CLS]", "[SEP]", "[PAD]", "[UNK]", "[MASK]"
  ),
  `Governance Note` = c(
    "BERT-family; vocabulary is fixed at training time",
    "512 tokens (~380 words); inputs exceeding this limit are silently truncated",
    "All input text is lowercased before tokenization",
    "Padding appended to the right for batch processing",
    "Prepended to every input sequence",
    "Appended after every input segment",
    "Used to pad shorter sequences in a batch to uniform length",
    "Replaces out-of-vocabulary tokens — a direct signal of domain coverage gaps",
    "Used during masked language model pre-training; not relevant at inference"
  )
)

tok_params |>
  kable(
    format    = "html",
    caption   = "Table 3: FinBERT Tokenizer Configuration",
    col.names = c("Parameter", "Value", "Governance Note"),
    align     = c("l", "l", "l")
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = TRUE,
    position          = "left",
    font_size         = 13
  ) |>
  column_spec(1, bold = TRUE, width = "14em") |>
  column_spec(2, width = "12em") |>
  column_spec(3, width = "28em")
Table 3: FinBERT Tokenizer Configuration
Parameter Value Governance Note
Tokenizer Class BertTokenizer BERT-family; vocabulary is fixed at training time
Model Max Length 512 512 tokens (~380 words); inputs exceeding this limit are silently truncated
Do Lower Case TRUE All input text is lowercased before tokenization
Padding Side right Padding appended to the right for batch processing
Special Token: [CLS] [CLS] Prepended to every input sequence
Special Token: [SEP] [SEP] Appended after every input segment
Special Token: [PAD] [PAD] Used to pad shorter sequences in a batch to uniform length
Special Token: [UNK] [UNK] Replaces out-of-vocabulary tokens — a direct signal of domain coverage gaps
Special Token: [MASK] [MASK] Used during masked language model pre-training; not relevant at inference

Vocabulary Composition

The vocab.txt file contains all 30,522 tokens the model recognizes. Any token absent from this vocabulary is silently mapped to [UNK] (unknown). For financial text — where regulatory acronyms, numeric sequences, and specialized terminology are common — the rate of out-of-vocabulary mappings is a direct indicator of model reliability in a given deployment context.

Display code
vocab       <- readLines(vocab_path)
total_tokens <- length(vocab)

vocab_df <- tibble(token = vocab) |>
  mutate(
    type = case_when(
      str_starts(token, "##")             ~ "Subword (continuation)",
      str_starts(token, "\\[")            ~ "Special token",
      str_detect(token, "^[0-9]+$")       ~ "Numeric",
      str_detect(token, "^[a-z]+$")       ~ "Lowercase word",
      str_detect(token, "^[A-Z][a-z]+$")  ~ "Capitalized word",
      str_detect(token, "^[A-Z]+$")       ~ "Uppercase / acronym",
      str_detect(token, "[^[:ascii:]]")   ~ "Non-ASCII / multilingual",
      str_detect(token, "[[:punct:]]")    ~ "Punctuation / symbol",
      TRUE                                ~ "Other"
    )
  )

vocab_summary <- vocab_df |>
  count(type, name = "count") |>
  mutate(
    pct = round(count / total_tokens * 100, 1),
    `Governance Implication` = case_when(
      type == "Subword (continuation)" ~
        "High subword ratio indicates morphologically rich coverage",
      type == "Special token" ~
        "Fixed control tokens; must not appear in user inputs",
      type == "Numeric" ~
        "Limited numeric coverage; financial figures may fragment across multiple tokens",
      type == "Lowercase word" ~
        "Core vocabulary; model lowercases all input before lookup",
      type == "Non-ASCII / multilingual" ~
        "Limited coverage; non-English regulatory text will degrade",
      type == "Uppercase / acronym" ~
        "Acronyms (LIBOR, DSCR, CET1) are lowercased before vocabulary lookup",
      TRUE ~ ""
    )
  ) |>
  arrange(desc(count))

vocab_summary |>
  kable(
    format    = "html",
    digits    = 1,
    caption   = glue("Table 4: FinBERT Vocabulary Composition (Total: {comma(total_tokens)} tokens)"),
    col.names = c("Token Type", "Count", "% of Vocab", "Governance Implication"),
    align     = c("l", "r", "r", "l")
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = TRUE,
    position          = "left",
    font_size         = 13
  ) |>
  column_spec(1, bold = TRUE, width = "14em") |>
  column_spec(4, width = "28em")
Table 4: FinBERT Vocabulary Composition (Total: 30,522 tokens)
Token Type Count % of Vocab Governance Implication
Lowercase word 21745 71.2 Core vocabulary; model lowercases all input before lookup
Subword (continuation) 5828 19.1 High subword ratio indicates morphologically rich coverage
Special token 1000 3.3 Fixed control tokens; must not appear in user inputs
Non-ASCII / multilingual 946 3.1 Limited coverage; non-English regulatory text will degrade
Numeric 861 2.8 Limited numeric coverage; financial figures may fragment across multiple tokens
Other 119 0.4
Punctuation / symbol 23 0.1
Display code
vocab_summary |>
  mutate(type = fct_reorder(type, count)) |>
  ggplot(aes(x = count, y = type, fill = type)) +
  geom_col(show.legend = FALSE, width = 0.65) +
  geom_text(
    aes(label = glue("{comma(count)} ({pct}%)")),
    hjust  = -0.05,
    size   = 3.5,
    color  = "grey30"
  ) +
  scale_x_continuous(
    labels = comma,
    expand = expansion(mult = c(0, 0.28))
  ) +
  scale_fill_manual(
    values = colorRampPalette(c(brand_accent, brand_highlight, brand_secondary))(
      nrow(vocab_summary)
    )
  ) +
  labs(
    title    = "FinBERT Vocabulary Composition",
    subtitle = glue("Total vocabulary: {comma(total_tokens)} tokens"),
    x        = "Token Count",
    y        = NULL,
    caption = NULL,
      tag = NULL
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 18, color = brand_text),
    plot.subtitle = element_text(color = "grey40", size = 14),
    axis.title = element_text(size = 12),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y   = element_blank()
  )
Figure 1

In the figure above, subword continuation tokens constitute the largest share, reflecting WordPiece segmentation of morphologically complex terms. Numeric coverage is limited — a governance consideration for financial figures.

Context Window Stress Test

The model_max_length: 512 parameter is a key limit for financial text applications. It sets a strict cap: any input over 512 tokens gets cut off. No error message is shown. The model just processes the first 512 tokens and ignores the rest.

This limit often affects financial documents, like regulatory disclosures and credit memoranda. If a disclosure starts neutral but ends with a serious event flag, it could be wrongly labeled as neutral. This happens if the negative content gets cut off after token 512.

The token counts below use a WordPiece method based on the vocab.txtfile downloaded earlier. FinBERT provides only the old BertTokenizer format. This means it doesn’t have a fast-tokenizer interface in R. For estimates, a 1.35 token-per-word multiplier is used for out-of-vocabulary terms. This aligns with benchmarks for financial English on BERT-base. It also adds two tokens for the required [CLS] and [SEP] control tokens.

Display code
# FinBERT uses the legacy BertTokenizer format (vocab.txt + tokenizer_config.json)
# rather than the fast tokenizer format (tokenizer.json) that tok::tokenizer$from_pretrained()
# requires. We approximate token counts using WordPiece heuristics applied directly
# to the vocab.txt file already downloaded — no additional Hub calls needed.
#
# WordPiece tokenization splits unknown or morphologically complex words into subword
# pieces. The empirical token-to-word ratio for financial English on BERT-base is
# approximately 1.35 (words expand ~35% into tokens). We apply that multiplier and
# add 2 for the mandatory [CLS] and [SEP] special tokens.

wordpiece_token_estimate <- function(text, vocab, multiplier = 1.35) {
  words   <- unlist(str_split(str_to_lower(str_trim(text)), "\\s+"))
  n_words <- length(words)
  # Words fully in vocabulary tokenize as 1 token; others are split (multiplier > 1)
  in_vocab <- mean(words %in% vocab)
  effective_multiplier <- in_vocab + (1 - in_vocab) * multiplier
  as.integer(round(n_words * effective_multiplier)) + 2L  # +2 for [CLS] and [SEP]
}

test_texts <- list(
  "Short headline" =
    "Federal Reserve raises rates by 25 basis points amid persistent inflation.",

  "Earnings call excerpt" =
    paste(
      "Our credit loss provisions increased 14% year-over-year, reflecting",
      "deteriorating macroeconomic conditions in our consumer lending portfolio.",
      "Net charge-off rates rose to 2.3%, up from 1.8% in the prior year period.",
      "We continue to monitor our commercial real estate exposure closely,",
      "particularly in the office sector, where vacancy rates remain elevated."
    ),

  "Regulatory disclosure (moderate)" =
    paste(
      "The Company is subject to various regulatory capital requirements administered",
      "by federal banking agencies. Failure to meet minimum capital requirements can",
      "initiate certain mandatory and possibly additional discretionary actions by",
      "regulators that, if undertaken, could have a direct material effect on the",
      "Company's financial statements. Under capital adequacy guidelines and the",
      "regulatory framework for prompt corrective action, the Company must meet",
      "specific capital guidelines that involve quantitative measures of the Company's",
      "assets, liabilities, and certain off-balance-sheet items as calculated under",
      "regulatory accounting practices. The Company's capital amounts and",
      "classification are also subject to qualitative judgments by regulators about",
      "components, risk weightings, and other factors."
    ),

  "10-K risk section (long)" =
    paste(rep(paste(
      "The Company faces significant credit risk in its lending portfolio.",
      "Adverse changes in economic conditions, including rising interest rates,",
      "increased unemployment, or declining real estate values, could result in",
      "higher levels of nonperforming assets and credit losses. The Company's",
      "allowance for credit losses may prove to be insufficient to cover actual",
      "credit losses, which could have a material adverse effect on the Company's",
      "financial condition and results of operations."
    ), 5), collapse = " ")
)

context_results <- imap_dfr(test_texts, function(text, label) {
  n_tokens  <- wordpiece_token_estimate(text, vocab)
  n_words   <- str_count(text, "\\S+")
  pct_limit <- round(n_tokens / 512 * 100, 1)
  tibble(
    `Text Type`      = label,
    `Word Count`     = n_words,
    `Token Count`    = n_tokens,
    `% of 512 Limit` = pct_limit,
    `Truncated?`     = if_else(n_tokens > 512, "YES", "No")
  )
})

context_results |>
  kable(
    format    = "html",
    caption   = "Table 5: Context Window Analysis — Token Counts for Representative Risk Texts",
    col.names = c("Text Type", "Word Count", "Token Count", "% of 512 Limit", "Truncated?"),
    align     = c("l", "r", "r", "r", "c")
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 13
  ) |>
  column_spec(1, bold = TRUE, width = "16em") |>
  column_spec(5, bold = TRUE) |>
  row_spec(
    which(context_results$`Truncated?` == "YES"),
    background = "#fff3cd"
  )
Table 5: Context Window Analysis — Token Counts for Representative Risk Texts
Text Type Word Count Token Count % of 512 Limit Truncated?
Short headline 11 13 2.5 No
Earnings call excerpt 50 56 10.9 No
Regulatory disclosure (moderate) 104 113 22.1 No
10-K risk section (long) 345 366 71.5 No
NoteTruncation Is Silent — and Material

BERT-family models cut off inputs that exceed 512 tokens without notice. In a risk workflow, this means the end of a long credit memo or regulatory disclosure is lost. Risk teams using any BERT-family model should track token counts for each input. Send truncated documents for manual review. Don’t let partial classifications move forward.

Model Governance Audit Card

The table below synthesizes the metadata review into a structured format aligned with SR 11-7 model documentation expectations. It is designed to serve as a transferable template: any BERT-family model can be audited using the same dimensions, with the findings column updated to reflect the specific model under review.

Display code
gov_card <- tibble(
  `Governance Dimension` = c(
    "Model Identifier",
    "Source / Provider",
    "Model Architecture",
    "Intended Use",
    "Training Data",
    "Output Schema",
    "Max Input Length",
    "Vocabulary Coverage",
    "Known Limitation: Truncation",
    "Known Limitation: Language",
    "Known Limitation: Domain Drift",
    "Known Limitation: Temporal",
    "Bias Considerations",
    "Recommended Validation Step",
    "SR 11-7 Classification"
  ),
  `Finding` = c(
    "ProsusAI/finbert (Hugging Face Hub)",
    "Prosus AI / Naspers; open-source (Apache 2.0 license)",
    "BERT-base-uncased (12 layers, 768 hidden, 12 heads, ~110M parameters)",
    "Financial news sentiment classification: positive / neutral / negative",
    "~4,840 financial news articles from Reuters and Bloomberg, labeled by Prosus AI analysts",
    "3-class softmax output; labels: 0 = positive, 1 = negative, 2 = neutral",
    "512 tokens (~380 words); inputs silently truncated beyond this limit",
    "30,522 tokens; primarily English; limited non-ASCII and numeric coverage",
    "Long documents (10-Ks, credit memos) may lose material content beyond token 512",
    "English only; multilingual or code-switched regulatory text will degrade model reliability",
    "Trained on news headlines; performance on internal risk memos and filings is unvalidated",
    "Training data predates the post-2022 rate cycle; macro sentiment calibration may have shifted",
    "Training labels reflect analyst judgments; inter-rater reliability and label subjectivity not published",
    "Back-test on firm-specific text prior to production deployment; log and flag all truncated inputs",
    "Vendor / third-party model; independent validation required before use in consequential decisions"
  )
)

gov_card |>
  kable(
    format    = "html",
    caption   = "Table 6: FinBERT Model Governance Audit Card",
    col.names = c("Governance Dimension", "Finding"),
    align     = c("l", "l")
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = TRUE,
    font_size         = 13
  ) |>
  column_spec(1, bold = TRUE, width = "16em") |>
  row_spec(c(9, 10, 11, 12, 13), background = "#fff3cd") |>
  row_spec(15, background = "#f8d7da", bold = TRUE) |>
  pack_rows("Model Identity",           1,  3) |>
  pack_rows("Intended Use & Outputs",   4,  6) |>
  pack_rows("Technical Constraints",    7,  8) |>
  pack_rows("Known Limitations",        9, 13) |>
  pack_rows("Risk & Compliance",       14, 15)
Table 6: FinBERT Model Governance Audit Card
Governance Dimension Finding
Model Identity
Model Identifier ProsusAI/finbert (Hugging Face Hub)
Source / Provider Prosus AI / Naspers; open-source (Apache 2.0 license)
Model Architecture BERT-base-uncased (12 layers, 768 hidden, 12 heads, ~110M parameters)
Intended Use & Outputs
Intended Use Financial news sentiment classification: positive / neutral / negative
Training Data ~4,840 financial news articles from Reuters and Bloomberg, labeled by Prosus AI analysts
Output Schema 3-class softmax output; labels: 0 = positive, 1 = negative, 2 = neutral
Technical Constraints
Max Input Length 512 tokens (~380 words); inputs silently truncated beyond this limit
Vocabulary Coverage 30,522 tokens; primarily English; limited non-ASCII and numeric coverage
Known Limitations
Known Limitation: Truncation Long documents (10-Ks, credit memos) may lose material content beyond token 512
Known Limitation: Language English only; multilingual or code-switched regulatory text will degrade model reliability
Known Limitation: Domain Drift Trained on news headlines; performance on internal risk memos and filings is unvalidated
Known Limitation: Temporal Training data predates the post-2022 rate cycle; macro sentiment calibration may have shifted
Bias Considerations Training labels reflect analyst judgments; inter-rater reliability and label subjectivity not published
Risk & Compliance
Recommended Validation Step Back-test on firm-specific text prior to production deployment; log and flag all truncated inputs
SR 11-7 Classification Vendor / third-party model; independent validation required before use in consequential decisions

Insights & Conclusion

The audit shows four important findings to address before using FinBERT or any BERT model in financial risk workflows.

First, the truncation issue is urgent. At 512 tokens, the model processes about 380 words. Most financial documents, like earnings calls and 10-K risk sections, exceed this limit. Truncation is silent, so a classifier using truncated input gives a confidence score that looks the same as one from complete input. Risk teams won’t know if the classification is partial unless token counts are clearly logged. Any deployment must address this.

Second, the training data for FinBERT is narrower than it seems. The model was trained on around 4,840 articles from Reuters and Bloomberg, labeled by Prosus AI analysts. This data is mostly focused on headlines and news. Using the model for internal credit narratives or regulatory findings introduces domain drift, which hasn’t been formally assessed. We don’t know how well the model performs with these text types.

Third, there is a vocabulary coverage gap for numeric tokens. Financial texts are full of figures—percentages, basis points, loan-to-value ratios. The BERT-base vocabulary has few pure-numeric tokens, so many multi-digit figures split into multiple sub-word tokens. This fragmentation doesn’t cause inference issues, but it means the model interprets “CET1 ratio of 13.4%” differently than a human would. This difference isn’t shown in the output scores.

Finally, SR 11-7 classification is clear: any model influencing a significant financial decision requires formal risk management.

This needs:

  • Documentation of intended use

  • Independent validation with firm data

  • Ongoing performance monitoring

The model card is just a starting point for this documentation, not a replacement.

The project shows that the metadata layer of an open-source model can be fully audited from R in under 50 lines of code. The hfhub and tok packages make governance easier. What used to require Python skills and manual web checks is now a simple, repeatable R workflow. For organizations creating AI governance programs, this reproducibility is crucial. An audit that can be re-run against an updated model card is much more defensible than one that can’t.

Session Information

#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.5.2 (2025-10-31)
#>  os       macOS Tahoe 26.2
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2026-05-11
#>  pandoc   3.8.3 @ /Applications/Positron.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
#>  quarto   1.8.26 @ /Applications/quarto/bin/quarto
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  cli            3.6.5   2025-04-23 [1] CRAN (R 4.5.0)
#>  data.table     1.17.8  2025-07-10 [1] CRAN (R 4.5.0)
#>  digest         0.6.39  2025-11-19 [1] CRAN (R 4.5.2)
#>  dplyr        * 1.1.4   2023-11-17 [1] CRAN (R 4.5.0)
#>  evaluate       1.0.5   2025-08-27 [1] CRAN (R 4.5.0)
#>  farver         2.1.2   2024-05-13 [1] CRAN (R 4.5.0)
#>  fastmap        1.2.0   2024-05-15 [1] CRAN (R 4.5.0)
#>  forcats      * 1.0.1   2025-09-25 [1] CRAN (R 4.5.0)
#>  generics       0.1.4   2025-05-09 [1] CRAN (R 4.5.0)
#>  ggplot2      * 4.0.2   2026-02-03 [1] CRAN (R 4.5.2)
#>  glue         * 1.8.0   2024-09-30 [1] CRAN (R 4.5.0)
#>  gtable         0.3.6   2024-10-25 [1] CRAN (R 4.5.0)
#>  hfhub        * 0.1.2   2026-04-15 [1] CRAN (R 4.5.2)
#>  hms            1.1.4   2025-10-17 [1] CRAN (R 4.5.0)
#>  htmltools      0.5.8.1 2024-04-04 [1] CRAN (R 4.5.0)
#>  htmlwidgets    1.6.4   2023-12-06 [1] CRAN (R 4.5.0)
#>  httr           1.4.7   2023-08-15 [1] CRAN (R 4.5.0)
#>  jsonlite     * 2.0.0   2025-03-27 [1] CRAN (R 4.5.0)
#>  kableExtra   * 1.4.0   2024-01-24 [1] CRAN (R 4.5.0)
#>  knitr        * 1.50    2025-03-16 [1] CRAN (R 4.5.0)
#>  labeling       0.4.3   2023-08-29 [1] CRAN (R 4.5.0)
#>  lazyeval       0.2.2   2019-03-15 [1] CRAN (R 4.5.0)
#>  lifecycle      1.0.5   2026-01-08 [1] CRAN (R 4.5.2)
#>  lubridate    * 1.9.4   2024-12-08 [1] CRAN (R 4.5.0)
#>  magrittr       2.0.4   2025-09-12 [1] CRAN (R 4.5.0)
#>  pillar         1.11.1  2025-09-17 [1] CRAN (R 4.5.0)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.5.0)
#>  plotly       * 4.11.0  2025-06-19 [1] CRAN (R 4.5.0)
#>  purrr        * 1.2.0   2025-11-04 [1] CRAN (R 4.5.0)
#>  R6             2.6.1   2025-02-15 [1] CRAN (R 4.5.0)
#>  RColorBrewer   1.1-3   2022-04-03 [1] CRAN (R 4.5.0)
#>  readr        * 2.1.5   2024-01-10 [1] CRAN (R 4.5.0)
#>  rlang          1.1.7   2026-01-09 [1] CRAN (R 4.5.2)
#>  rmarkdown      2.30    2025-09-28 [1] CRAN (R 4.5.0)
#>  rstudioapi     0.17.1  2024-10-22 [1] CRAN (R 4.5.0)
#>  S7             0.2.1   2025-11-14 [1] CRAN (R 4.5.2)
#>  scales       * 1.4.0   2025-04-24 [1] CRAN (R 4.5.0)
#>  sessioninfo  * 1.2.3   2025-02-05 [1] CRAN (R 4.5.0)
#>  stringi        1.8.7   2025-03-27 [1] CRAN (R 4.5.0)
#>  stringr      * 1.6.0   2025-11-04 [1] CRAN (R 4.5.0)
#>  svglite        2.2.2   2025-10-21 [1] CRAN (R 4.5.0)
#>  systemfonts    1.3.1   2025-10-01 [1] CRAN (R 4.5.0)
#>  textshaping    1.0.4   2025-10-10 [1] CRAN (R 4.5.0)
#>  tibble       * 3.3.0   2025-06-08 [1] CRAN (R 4.5.0)
#>  tidyr        * 1.3.1   2024-01-24 [1] CRAN (R 4.5.0)
#>  tidyselect     1.2.1   2024-03-11 [1] CRAN (R 4.5.0)
#>  tidyverse    * 2.0.0   2023-02-22 [1] CRAN (R 4.5.0)
#>  timechange     0.3.0   2024-01-18 [1] CRAN (R 4.5.0)
#>  tok          * 0.2.2   2026-04-22 [1] CRAN (R 4.5.2)
#>  tzdb           0.5.0   2025-03-15 [1] CRAN (R 4.5.0)
#>  vctrs          0.7.1   2026-01-23 [1] CRAN (R 4.5.2)
#>  viridisLite    0.4.3   2026-02-04 [1] CRAN (R 4.5.2)
#>  withr          3.0.2   2024-10-28 [1] CRAN (R 4.5.0)
#>  xfun           0.54    2025-10-30 [1] CRAN (R 4.5.0)
#>  xml2           1.4.1   2025-10-27 [1] CRAN (R 4.5.0)
#>  yaml           2.3.10  2024-07-26 [1] CRAN (R 4.5.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library
#>  * ── Packages attached to the search path.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Rendered with Quarto. Analysis conducted in R using hfhub, tok, tidyverse, kableExtra, plotly