Machine learning models are now a routine part of credit risk decisioning for many lenders, with benefits in performance, typically a 5-10% increase in relative Gini, versus traditional approaches too significant to ignore. However, models based on advanced techniques are, rightly, subject to regulatory scrutiny and greater caution by many risk managers.
GBMs and DNNs are still relatively new in heavily regulated environments like banking and insurance. They’re less transparent than linear models, involve many more parameters, and haven’t yet been through multiple full economic cycles in live use.
In fact, it’s a concern we shared initially.
As Nick Sime puts it:
“Back in 2018, when we first started building these models, I expected models to degrade faster as a natural consequence of the complexity. It soon became apparent that my concerns were misplaced.”
Now, with the passage of time, we are in an excellent position to understand the longer-term performance stability of machine learning models compared to traditional linear approaches.
We tracked performance with the benefit of multi-year outcome periods that extend well beyond typical redevelopment cycles, comparing GBMs and DNNs with traditional linear benchmarks.
This article sets out what we’ve observed. It’s the first in a short series exploring how modern ML models behave after deployment and what that means for managing model risk in practice.
|
This article looks at what we see when we do that. It’s the first in a short series exploring how modern ML models behave once they’re live (across time, population change, and stress) and what that means for model risk management in practice. Much of the analysis referenced here was carried out using Archetype, which allows us to develop linear models, GBMs, and DNNs within a single framework. |
|
About the author Nick Sime, Jaywing’s director of risk modelling, has been at the forefront of risk models since he headed up risk modelling at HSBC 30 years ago. He is an advocate for using explainable AI safely in controlling risks, ensuring that Jaywing clients such as HSBC, Virgin Money, Nationwide and many challenger banks take all the predictive advantages these technologies offer in terms of increased acceptance levels and reduced financial losses whilst ensuring appropriate safeguards are in place to ensure integrity on customer decisions. Jaywing have been implementing controllable and explainable AI models since 2018 and have long run data to show robustness and stability.
|

Why credit models degrade over time
All risk models degrade over time. The question is why, and how quickly.
There are a few common reasons.
- Economic and market conditions change: Changes in interest rates, competitor activity, or employment affect borrowers’ ability to repay. Patterns that were predictive at build become less reliable as conditions move on.
- The applicant population changes: New customer segments appear, others disappear, and behaviours evolve. Models trained on an earlier population can struggle if those changes are material.
- Risk drivers themselves change: Regulatory updates, product changes, and new customer behaviour can alter how individual characteristics relate to default. Even if the population looks similar, the risk dynamics can drift.
- Development choices matter: Models that are overfitted, poorly validated, or built on fragile assumptions are likely to degrade faster once they are live. Weaknesses introduced at build often only become visible over time.
These factors affect all credit models, regardless of technique.
Why machine learning raises additional concerns
Machine learning models introduce the same risks as any other credit model, but they also add a few of their own.
They are very different to established logistic regression approaches that regulators and risk teams have decades of experience with. GBMs and DNNs are newer in highly regulated environments and have not yet been through multiple full economic cycles in live production.
Whilst steps can be taken to improve the explainability of ML models they are, inevitably, harder to interpret. Compared to linear models, ML models contain far more parameters, which makes it less obvious how individual characteristics influence outcomes.
Taken together, this leads to a natural concern that the performance of ML models will degrade more rapidly over time.

Why complexity ≠ instability
Modern machine learning models have several innate properties that support stability.
Most ML approaches are ensemble-based. GBMs combine many decision trees, and DNNs combine many functions. Rather than relying on a single equation, they aggregate many weak decisions. That aggregation tends to smooth out errors and reduce sensitivity to noise.
ML models can also use a broader set of characteristics than linear models. Linear approaches often hit a ceiling where adding more inputs brings little benefit and can introduce instability. When a small number of characteristics drive most of the prediction, any loss of signal can have a large impact. ML models tend to spread risk across many inputs, reducing reliance on any single factor.
They are also less dependent on fixed correlation structures. Linear models assume stable relationships between characteristics, even though those relationships often change over time. Tree-based and other non-linear models are generally less exposed to this kind of correlation drift.
Finally, most ML techniques include regularisation as a core feature. These mechanisms limit overreaction to noise and help models generalise better when data becomes less stable.
None of this guarantees long-term stability. But it helps explain why complexity alone doesn’t automatically lead to faster degradation.
The following lessons reflect what we see most often when machine learning models are taken from development into live use.
Lesson #1: Overfitting is where stability breaks
If ML Models are technically sound and empirically robust, what is the problem?
This comes back to how the model was fitted at development.
On many occasions, we see that the strongest-looking test performance is achieved by models with a significant performance gap between build and validation samples. This should be a concern.
As Nick puts it:
“Very often, the choice of model configuration is made based on hyper-parameter search. It can be wise to trade off a little test performance for a model with a narrower gap between build and validation performance. That’s far more likely to result in stable long-term behaviour.”
We see this in the data.
More conservatively fitted models usually age better. They may give up a small amount of initial uplift, but are more resilient as conditions change.

💡 The takeaway: Performance of your model on an independent test should not be your only selection criteria, look carefully at performance within your Build & Validation samples and make prudent choices, favouring simpler configurations for long-term stability.
Lesson #2: Machine learning models should be highly resilient
For many reasons, some of which we have already covered, models (irrespective of type) are likely to degrade over time.
The complexity and limited transparency of Machine Learning models raise significant concerns. But when you actually track performance over multiple years, the story isn’t that simple.
Across our analysis, both GBMs and DNNs maintain a clear uplift over the linear benchmark throughout a 5-year outcome window. Over time, that performance gap widens rather than converging.

If more rapid degradation were an inherent feature of machine learning, you’d expect performance to converge over such a lengthy period. That isn’t what we observe. Instead, performance uplifts are resilient when models are built and selected with appropriate care.
💡 The takeaway: Long-term performance stability depends less on model class alone and more on how models are built and selected. Given their underlying properties, well-designed GBMs and DNNs can deliver stronger and more stable performance over time than traditional linear approaches.
Lesson #3: Broaden your horizon
Typically, we observe developers of Machine Learning models following thorough, disciplined workflows.
Almost without exception model builders will run a hyperparameter search and land on the model with the best performance. It is equally likely that the performance of the chosen model will be applied to an Out of Time test sample with a check to ensure the Gini has not fallen and the uplift versus the incumbent model is maintained.
But this misses a trick, as the approach ignores the potential for Machine Learning models, particularly heavily overfitted models, to capture noise that is simply specific to the development period.
Performing model selection based on Out of Time samples gives a more realistic view of performance, removing temporal noise, and so provides a more rigorous test of the ability of the model to generalise when deployed
💡 The takeaway: If you select models primarily on in-time performance, projected benefits may be overstated, and more crucially longer-term stability may be impacted. Being able to run out-of-time validation alongside in-time testing is critical here. In Archetype, this kind of comparison is built into the workflow, which makes it easier to spot overfitting before models go live.
Lesson #4: Add an expert layer
Machine Learning models can be developed in quite highly automated processes. Literally thousands of features can be manipulated, transformed, and reduced to create a model with just a few clicks.
Within the model development process, a manual review has an important role to play and provides additional control to support longer-term model performance.
Logical constraints control the marginal impact of features on the final prediction. They support explainability, reduce noise, and prevent models from memorising patterns that don’t persist. In practice, features should be constrained by default, with exceptions only permissible where clearly justified.
A manual expert overlay is equally important. Reviewing model components for clear, sensible directional impact helps identify weak or noisy variables that may not remain predictive over time.
In credit risk, where stability matters, these controls often matter more than the choice between GBMs and DNNs. While we see broadly similar long-term behaviour from both, there are nuances we’ll return to in a later article.
💡The takeaway: Stability comes from discipline and judgement, not from the algorithm alone.
Final thought: Time is the real test
For lenders, the debate around machine learning has moved on.
What we see consistently is that long-term performance comes down to development discipline. Simpler configurations, sensible constraints, proper validation, and a final expert review all make a difference to how models behave over time.
There’s also a trade-off to be aware of. Being overly cautious with machine learning, or stepping back from it based on assumptions about instability, can mean leaving performance on the table year after year.
Archetype is built to support this way of working. It makes it easier to apply these controls consistently and to compare different modelling approaches side by side, so teams can move from debate to decisions they’re comfortable standing behind.
The next question (and the focus of the next article in this series) is what drives differences in resilience in practice, particularly around sample size, population change, and stress.
That’s where we’ll go next.
Did you know, Archetype allows teams to develop and assess linear models, GBMs (including XGBoost), and DNNs side by side, using a consistent data pipeline and validation framework. That makes it easier to see where uplift comes from, where it holds up, and where risk starts to creep in.
Talk to us to find out more. We’d genuinely love to help.