The author

Carl Ireland

Head of Regulatory Risk

View profile
News & Views / Modernising credit-risk models without drifting outside risk appetite
20 April 2026

Modernising credit-risk models without drifting outside risk appetite

Many lenders are relying on models built for very different market conditions. These models were calibrated on borrower profiles, employment structures and economic environments that have shifted fundamentally in recent years. A clear example is the challenge of estimating forward-looking default rates in an inflationary environment, where models built on more benign periods have shown clear performance deficiencies.

As conditions change, models built on historical assumptions can drift out of line with the risks firms believe they are taking. Lending decisions may still follow the scorecard, but the portfolio being originated no longer reflects the risk appetite the board believes it has approved.

Scorecards rely on fixed weightings derived from historical data and do not adjust as borrower behaviour evolves. Monitoring frameworks often detect this slowly, particularly where validation cycles are periodic, meaning deterioration can build for months before it becomes visible. By that point, the firm may already be originating business outside its intended risk profile.

Many firms are now looking to modernise their credit models, often using explainable machine learning. Multi-year performance tracking of GBMs and DNNs shows these concerns around instability are often overstated when models are built with appropriate discipline. The challenge is enforcing governance and monitoring frameworks evolve alongside them, so model performance remains controlled and predictable.

This blog covers:

  • Why legacy scorecards can drift out of line with actual borrower risk as economic conditions and income structures, along with changing behaviours
  • How weakening risk ranking and calibration can push lending outside board-approved risk appetite, with implications for pricing and provisioning, as well as capital planning
  • Why many modernisation programmes fail, often because the issue is misdiagnosed as a modelling problem rather than data, governance or decision architecture
  • What a hybrid approach looks like, with scorecards providing a stable decision backbone and machine learning adding value where traditional data is weak
  • The governance and monitoring capabilities required to introduce these techniques without losing control of model performance or auditability

Why legacy scorecards drift out of alignment with borrower risk

Traditional credit scorecards rely on fixed weightings and binning calibrated to historical data. This structure provides stability and transparency, but it also means the model reflects the borrower population and economic conditions that existed at the point of development.

As those conditions change, the assumptions embedded in the model begin to age. Borrower behaviour evolves, income structures become more varied, and economic cycles introduce different stress patterns across the portfolio. Because the scorecard cannot adjust automatically, its ability to rank risk accurately can deteriorate.

We recently looked into how interconnected model systems can increase model risk exposure, often exacerbated by API logic failures or semantic drift. Here, the focus is on how the individual models within that system behave as those underlying assumptions change.

Development samples that no longer reflect today’s borrowers

Representativeness of the development sample is often a core issue. Many bureau-based models were built on populations where employment structures and income profiles were relatively stable. The current applicant base is more diverse, with a wider range of income sources and employment arrangements.

When the population the model was trained on no longer resembles the borrowers it is applied to, deterioration in risk ranking and calibration becomes difficult to avoid.

Portfolio segmentation hidden behind a single score

Risk can also be obscured by the way portfolios are summarised. A single score applied across a broad customer base can mask differences between borrowers.

Segments that respond differently to economic stress may still appear similar within the overall score distribution. This makes it harder to identify emerging concentrations of risk until performance has already begun to deteriorate.

Weakening rank-order performance and PD calibration

As rank-order performance weakens, probability of default (PD) estimates become less reliable. The portfolio being originated can begin to diverge from the assumptions used in pricing, provisioning and capital planning.

This has direct implications for IFRS 9. Significant Increase in Credit Risk (SICR) triggers, which determine whether exposures move from Stage 1 to Stage 2, are often scorecard-derived. Where rank-order performance has deteriorated, these triggers become less reliable, and the model may fail to identify deteriorating borrowers early enough to prompt staging changes.

This can lead to understated provisions. At the same time, a miscalibrated PD feeding into lifetime ECL calculations compounds the issue at portfolio level. If monitoring frameworks do not track SICR trigger reliability alongside performance indicators like KS statistics and the PSI, these risks may only become visible during audit or supervisory review.

Governance gaps and monitoring blind spots

Governance practices can reinforce these issues. Scores developed for originations are often extended into areas such as limit management or pricing without formal use-extension review. For firms operating under IRB approval, this creates a direct obligation under SS1/23. For others, similar expectations exist under SM&CR and broader supervisory principles.

Monitoring frameworks also tend to focus on rank-order metrics such as Gini, with less attention given to calibration or early indicators of population drift, including measures such as PSI.

In these situations, the scorecard continues to produce outputs and may appear operational. However, the relationship between the score and underlying borrower risk is no longer as reliable as the firm assumes.

Why credit model modernisation is often misdiagnosed

When firms set out to modernise their credit models, the focus typically turns to modelling techniques. Attention quickly moves to whether more advanced approaches, such as gradient boosting machines (GBMs) or deep neural networks (DNNs), can improve predictive performance. This means the model itself is rarely the primary constraint; performance gains from these techniques are only realised when the supporting data and governance frameworks are fit for purpose. Without those foundations, the risks introduced by more complex methods can outweigh any improvement in predictive accuracy. Our analysis of GBMs and DNNs in live deployment shows broadly similar long-term stability properties when models are built with appropriate discipline. Stability is determined more by development approach than by model class. Where this discipline is absent, performance issues tend to reflect weaknesses in data, validation or monitoring rather than the modelling technique itself.

Modernisation programmes often prioritise improvements in standard performance indicators. Jaywing’s own analysis shows that machine learning models can deliver sustained uplifts in relative Gini of 5–10% over multi-year periods. However, improvements in standard performance indicators do not necessarily translate into better lending decisions. If cut-offs, affordability assessments or manual overrides remain unchanged, the overall decisioning framework may not improve. A model can deliver a measurable statistical lift while still producing poorer outcomes at the policy boundary. Expected loss at the decision threshold is often a more meaningful indicator of whether a model is improving credit decisioning.

Performance issues are frequently attributed to the model when the underlying problem sits in the data. Development samples may not reflect the current applicant population, and variables may behave differently across time or customer segments. Variables that appear predictive during development can be outcome-correlated in ways that do not persist in production. This can create an inflated view of model performance that deteriorates once the model is deployed.

Model validation is often treated as a one-off exercise prior to deployment. This approach does not align with current supervisory expectations, particularly under SS1/23 for IRB firms, where validation is expected to operate as an ongoing control. Where validation is concentrated at the end of development, issues such as population drift, feature instability or sensitivity at decision boundaries may not become visible until the model is in production.

Another common issue is how models are embedded within decision systems. Risk estimation, policy rules and manual interventions are often tightly coupled. When these elements change at the same time, it becomes difficult to determine whether changes in portfolio performance are driven by the model itself or by adjustments to policy or operational processes. Separating risk estimation from policy design allows firms to attribute performance changes more accurately and supports more effective challenger testing.

Modernising credit risk models

Designing a hybrid approach to credit modelling

For lenders at an early stage of adopting machine learning, modernisation does not require the immediate replacement of traditional scorecards. A hybrid approach allows firms to introduce more advanced techniques without destabilising existing decision frameworks.

In this structure, scorecards provide a stable and transparent decision backbone, while machine learning models are used to generate additional insight or support specific use cases. As firms build capability in managing more complex models, reliance on machine learning can increase over time.

Evidence from multi-year model tracking shows that well-built GBMs and DNNs can maintain performance advantages over linear models across extended outcome periods. The role of the hybrid approach is to facilitate controlled experimentation rather than defining a static end state.

1. Retain the scorecard as the decision backbone

In a hybrid architecture, the traditional points-based scorecard remains the primary measure of credit risk. This provides stability in decisioning and supports consistent adverse action reasoning for declined applicants.

Maintaining the scorecard at the core of the process also aligns with established governance expectations around validation and documentation, alongside regulatory review.

2. Use machine learning where traditional data provides limited signal

Machine learning tends to add most value in segments where bureau data is less informative.

Thin-file applicants are a common example. In these cases, alternative data sources such as open banking can provide behavioural signals that improve risk differentiation without replacing the core scorecard.

Model selection should reflect the data available. In lower-volume segments with limited observed defaults, deep neural networks may be more appropriate, as GBMs are more susceptible to overfitting when outcome density is low. As portfolios scale and more outcomes are observed, GBMs typically become more reliable. The choice should be evidence-led and justified.

3. Development discipline is the primary control for model stability

The case for the empirical case for machine learning does not rely on accepting a higher instability risk. Long-term performance is driven by development discipline rather than model type.

Multi-year outcome data show that well-built GBMs and DNNs can maintain stable and, in some cases, widening performance advantages over traditional scorecards. Where models degrade, this is typically linked to issues such as overfitting or feature instability rather than the modelling approach itself.

Development standards — including out-of-time validation, conservative configuration and feature constraints — should therefore be clearly defined, documented and consistently applied.

4. Separate risk estimation from lending policy

Risk estimation should remain distinct from policy decisions such as eligibility rules, affordability thresholds and manual overrides.

Where these elements are tightly coupled, it becomes difficult to determine whether changes in portfolio performance are driven by the model or by policy adjustments. Separating these components allows for clearer attribution of performance changes and supports more robust challenger testing.

5. Define clear fallback criteria

Hybrid systems require predefined recovery paths if model performance deteriorates.

Quantitative thresholds should be agreed in advance to trigger a reversion to the baseline scorecard. For example, a Population Stability Index (PSI) exceeding a defined threshold, such as 0.25, can serve as a trigger to revert to the decision logic while the issue is investigated.

Clear ownership of these thresholds and escalation processes is essential to ensure consistent and timely intervention.

6. Embed explainability within the decision pipeline

Explainability should be built into the modelling and decision pipeline so that internal stakeholders can interrogate model behaviour and challenge outputs.

This is distinct from consumer-facing explanations. Internal explainability supports governance, validation and model oversight, while customer explanations are designed to help applicants understand the reasons behind a decision.

Governance readiness for explainable machine learning

Traditional scorecard governance frameworks were designed for relatively simple, linear models with stable and transparent inputs. Machine learning models introduce different challenges, particularly in how they behave across different customer profiles and how sensitive they are to changes in underlying data.

For that reason, introducing these techniques into high-stakes credit decisioning requires governance capable of challenging, monitoring and evidencing model behaviour at a greater level of depth.

Several capabilities are particularly important.

  • Independent Model Validation (IMV) capability: Validation teams need the technical expertise to challenge non-linear model logic and understand how features interact within machine learning models. This includes assessing whether post-hoc explainability techniques behave consistently across customer profiles and whether they provide an accurate representation of how the model operates.
  • Bias and proxy discrimination testing: As datasets become larger and more complex, the risk increases that apparently neutral variables act as proxies for protected characteristics. Governance frameworks, therefore, need statistically significant bias testing capable of identifying proxy relationships within high-dimensional data, and of assessing whether these relationships remain stable across different population segments.
  • Ongoing monitoring of feature drift: Machine learning models can respond differently from linear models when the underlying data cahnges, particularly where there has been overfitting to the training period. Monitoring frameworks need to track feature behaviour on an ongoing basis to verify development assumptions remain valid, and to detect emerging population changes before they affect decision quality.
  • Empirical testing of consumer explanations: Consumer Duty expectations require that explanations provided to customers genuinely support understanding of credit decisions. Governance processes should therefore test whether explanations are meaningful in practice, rather than simply demonstrating that explanation methods exist.
  • Assessment of data vintage risk: Models trained on a narrow set of economic conditions can behave unpredictably when those conditions change. Validation processes should assess data vintage risk to ensure training data is not overly concentrated in specific economic periods, such as the pre-2022 low inflation and low interest rate environment.

If these governance capabilities are not in place, machine learning models introduce risks that traditional scorecard frameworks were not designed to manage. The issue is not the modelling technique itself, but whether the surrounding controls are capable of challenging and monitoring it effectively.

Model modernisation in credit risk

Capabilities required to operate modern credit models

Modernising credit models is not just a modelling exercise. It requires coordination across data and development, as well as validation and monitoring to ensure models remain stable, explainable, and aligned with the intended risk and decision outcomes.

In our experience, a small number of capabilities consistently distinguish firms that can modernise safely from those that remain constrained by legacy decision-making.

◻️ Centralised feature stores and documented data lineage. Model features need to be defined and managed consistently across development, deployment and monitoring. Centralised feature stores help ensure that the same inputs are used throughout the model lifecycle, while well-documented data lineage allows firms to trace how data moves from source systems through to model inputs. This supports regulatory expectations around data governance, including BCBS 239.

◻️ Independent Model Validation equipped with machine learning expertise. Independent Model Validation (IMV) teams must be equipped to challenge models built using advanced techniques. This includes understanding non-linear model behaviour, interrogating feature interactions and assessing whether performance remains stable once models are deployed.

◻️ Accountability for threshold adjustments and overrides. Changes to decision thresholds and overrides should be clearly logged and attributed to a named senior manager. This creates a clear audit trail and aligns with SM&CR expectations around individual accountability for model and policy decisions.

◻️ Automated monitoring of population drift and decision errors. Monitoring frameworks should track population drift and decision outcomes in near real time. This allows emerging issues to be identified early and reduces the risk that deterioration in model behaviour translates into credit or conduct risk.

◻️ Reconstruction of the full data flow and decision path. Firms need to be able to reconstruct how a decision was reached, from raw data inputs through to the final customer outcome. This provides the evidence required to demonstrate compliance with Consumer Duty and to respond effectively to regulatory or audit challenge.

◻️ Documented fallback criteria with named ownership. Fallback triggers and performance thresholds should be defined in advance, with clear ownership assigned to a named senior manager. This confirms that decisions to override automated monitoring or revert to baseline models are controlled, consistent and auditable.

With these capabilities in place, credit models can be treated as actively managed systems rather than static tools. Models may evolve as data and economic conditions change, but the surrounding framework ensures lending decisions remain stable, explainable and aligned with risk appetite.

Operating modern credit models with confidence

Credit models will continue to evolve as borrower behaviour, data availability and economic conditions change. The challenge for lenders is ensuring those models remain under continuous oversight as conditions change.

Models built on historical assumptions will drift if they are not actively monitored, challenged and recalibrated. Introducing more advanced techniques does not remove that risk. In many cases, it exposes weaknesses in data, validation and decision frameworks more quickly.

The firms that modernise successfully are not those adopting new modelling approaches fastest, but those that can operate them within a controlled framework. That means clear separation between risk estimation and policy, ongoing validation as a control rather than a checkpoint, and monitoring that detects changes in model behaviour before they affect decision quality.

With these elements in place, credit models can be treated as managed systems rather than static tools. Performance can improve, new data can be incorporated, and modelling approaches can evolve, without weakening control or increasing unintended risk.

That’s where much of our work at Jaywing focuses on helping lenders design those frameworks so modern credit models can be introduced without weakening control.