The author

Jeremy Bryson

Strategic Advisor, Risk Architecture

View profile
News & Views / How well do you know your data?
22 August 2025

How well do you know your data?

Use of timely, accurate and quality data that is transformed into analysis, models, and risk assessment is essential for decision making. Regulators, through BCBS 239 and other initiatives, have attempted to ensure the data being used is understood, controlled, and ultimately fit for purpose. However, they often want simple answers and do not deal well with errors or confidence intervals and the uncertainty they portray.

Imagining the scenario where information is presented as ranges, as opposed to single values, you can imagine how uncertainty can creep into decision-making. In reality, we are often dealing with estimates as to the true position; it's just that we’ve either explicitly or implicitly accepted the quality of the data we’re using.

Everyone wants to spend time on the information that the data tells us, such as model outputs. This means it is crucial that the data, transformations, and controls are understood and quality thresholds are determined; however, there is often a lack of focus in this area. To add further complexity, data use is driven by context. The accuracy requirements can be different depending on which types of decisions are being made. 

In our experience, structural flaws in the data are behind many of the most persistent model performance problems. According to Gartner, financial institutions lose an average of $15 million per year due to poor data quality.

A structured data audit brings those risks to the surface early. It can prevent costly rework, strengthen model defensibility, and give risk leaders greater confidence in the models they approve.

In this article, we outline why firms should start by scrutinising the data and what a robust audit should cover.

📕 You may also like: Turn data into strategic business value

Why we overlook data risk

Within the model lifecycle, attention is centred on the model layer: performance metrics, segmentation logic, overrides. Data, by comparison, tends to receive a lighter touch: basic checks for completeness, missing values, or surface-level quality control.

This imbalance is often a result of firms accepting that the “data is the data”  rather than defining the quality required and addressing any deficiencies, before it gets used in anger. The impact is real. A Mosaic Smart Data survey found that 66% of banks face persistent data quality and integrity issues, while 83% lack real-time access to transaction data due to fragmented systems. That being the case, model outputs can be suspect and cannot be used for low latency decision making.

Data issues are often noted in model validation reports. Still, by then, they are typically accepted because going back to the drawing board and redeveloping the model is costly and time-consuming. 

Regulators increasingly view poor data as a form of model risk. And with more models drawing on behavioural, granular, or third-party data, those risks are harder to ignore.

Some of the most common, and frequently missed, risks include:

  • Inconsistent variable definitions — where the same feature is constructed or interpreted differently across environments
  • Undocumented transformations — logic applied outside the core build, invisible to reviewers
  • Lineage gaps — where input data can’t be traced cleanly from source to model-ready format
  • Version drift — when datasets are refreshed without revalidation, introducing instability
  • Redundant or legacy inputs — features still in use despite being outdated or unmonitored

These issues rarely show up in performance metrics. Once embedded in a model, they become harder to isolate and even harder to remediate.

According to industry research, data scientists already spend around 60% of their time on data preparation, with another 19% spent simply trying to locate the right information. When structural data issues surface late in validation, this resource drain intensifies. Often leading to a choice between costly rework, delayed approvals, and mounting governance debt or accepting the model deficiencies.

What a data audit must assess

A good audit should answer a simple but critical question: Can this dataset support the model’s assumptions, outcomes, and regulatory requirements, with confidence?

That requires more than checking for completeness or accuracy. It means testing whether the data environment is robust, explainable, and fit for purpose. The most effective audits focus on five key areas:

1. Lineage and traceability

You should be able to trace every variable back to its source and evidence the path it took to the modelling dataset. That includes identifying all transformation steps, manual adjustments, and staging processes. Gaps in lineage raise questions about reproducibility, auditability, and compliance.

2. Variable construction and logic

Many features are reused from earlier models without reassessment. Audits need to confirm how each variable was built, whether definitions are still valid, and if logic is applied consistently across development, validation, and production. Without this, assumptions can drift unnoticed.

3. Governance and ownership

Is it clear who owns each dataset? Are there controls in place to manage updates, versioning, and access? Strong governance underpins reliability and is increasingly demanded by both internal audit and external regulators.

4. Completeness, consistency, and reconciliation

This goes beyond filling gaps. Audits should test for conflicting sources, unflagged outliers, structural bias, and inconsistent field treatments across systems or time periods. Poor reconciliation between environments often leads to hidden errors that affect model behaviour.

5. Usage alignment

Every input should earn its place. Are the variables used in the model still necessary? Do they materially contribute to performance? Are they understood by the teams using or approving the model? Eliminating redundant or poorly understood features helps reduce complexity and improve explainability.

A well-executed audit does more than surface issues. It provides clear evidence that the data behind the model is stable, governed, and aligned to its purpose. That’s the level of assurance regulators expect. And it’s what enables model owners to stand behind their decisions with confidence.

Spotlight: Regulatory expectations for data

Statistical performance is no longer the only benchmark for model approval. Supervisors increasingly want evidence that the data feeding the model is reliable, governed, and fully explainable, particularly for models used in IFRS 9, ICAAP, and stress testing.

We’ve seen these expectations reflected across multiple supervisory reviews, with common findings including:

  • “Insufficient traceability of input variables”
  • “Inadequate documentation of variable construction”

  • “Limited governance of third-party data dependencies”

  • “Unclear alignment between development and production data pipelines”

These issues aren’t just technicalities. Even well-performing models may be challenged or delayed if they rely on undocumented or opaque data structures.

The PRA continues to flag data quality and governance concerns in IRB and IFRS 9 reviews. The ECB and EBA have issued similar warnings. In fact, the ECB’s TRIM review found that model validation generated more findings than any other topic, including the largest number of severe findings.

This focus is only intensifying, particularly as firms scale up AI-driven models and incorporate increasingly granular behavioural data. For risk leaders and model owners, the message is clear: if you can’t explain the data, you can’t defend the model.

The benefits of starting with data (before you even touch the model)

Beginning any model development with a structured data audit gives early visibility into risks that can otherwise derail the process — or worse, go undetected until post-implementation. For model owners, this shift improves assurance, reduces rework, and strengthens regulatory positioning.

Here’s what that means in practice:

Early identification of systemic risk

Data audits often reveal structural flaws that affect not just the model under review, but others built using the same logic, variables, or pipelines. Addressing these issues early prevents duplicated risk and avoids spreading the same weakness across your portfolio.

Reduced rework and remediation effort

When data issues surface late in validation, the fix is rarely simple. It usually means redevelopment, overlays, or additional governance documentation. By tackling data problems first, firms avoid unnecessary rework and reduce the risk of late-stage blockers.

Stronger regulatory evidence

A documented audit trail of data inputs, ownership, and transformation logic helps demonstrate that the model is built on sound foundations. This builds trust with supervisors and can significantly speed up approval cycles.

Improved model defensibility

If a model is challenged, internally or externally, firms need to justify both the outputs and the inputs. A comprehensive audit shows that assumptions are backed by governed, reliable data. That’s increasingly what regulators and audit committees want to see.

Better business alignment

A data-first approach ensures that models reflect current customer behaviour, policy design, and regulatory context. That alignment reduces the risk of unintended outcomes and strengthens the link between model outputs and real-world decision-making.

Reduced cost of data management

The cost of managing data falls as trust in its use increases. The cost of ensuring data quality during the lineage and traceability process is significantly reduced, and there is increased reuse of model components.

The bottom line

Taking a data-first approach gives firms a clearer view of structural risks, improves internal confidence in the model, and provides stronger evidence during regulatory or audit scrutiny.

This is what enables model owners, developers, and senior risk leaders to stand behind the models they approve.

How Jaywing approaches data-first validation

Jaywing’s model validation methodology begins with a structured audit of the data architecture, tracing variables back to source, assessing reproducibility, and testing for governance gaps and consistency across environments.

We don’t stop at identification. Where issues are found, such as undocumented transformations, lineage gaps, or unmonitored variables, we provide clear, actionable remediation plans to help clients close governance gaps and prepare for supervisory review.

Our data audit process draws on 25+ years of data management experience and is underpinned by our broader capabilities in data strategy, compliance, and infrastructure design. We support clients across the full lifecycle of data assurance, including:

  • Auditing and data scoping – Understand and document your data sources and flows
  • Normalisation and cleansing – Ensure data consistency, accuracy, and reliability
  • Merging and connectivity – Integrate fragmented sources into a scalable environment
  • Architecture design – Build efficient, modern, and compliant data infrastructure, be that technology, processes or roles and responsibilities
  • Data build and maintenance – Create and optimise structured, model-ready repositories
  • Data governance – Establish clear roles, responsibilities, and controls
  • Regulatory readiness – Align to BCBS 239, DCAM, GDPR and internal frameworks
  • Documentation and transparency – Describe the data structure to support auditability
  • Strategic alignment – Help C-suite leaders use data to drive better decisions

Our approach is rooted in banking-strength data practices and shaped by deep experience in regulatory reporting, AI optimisation, and risk modelling. We understand the demands of IFRS 9, ICAAP, and regulatory redress programmes, and we tailor our support to meet each organisation’s risk appetite, infrastructure maturity, and governance needs.

Find out more about our data management capabilities.