The author

Nick Sime

Director of Fraud & Credit Risk Modelling

View profile
News & Views / Sample size and model choice: When GBMs outperform DNNs in credit risk
12 March 2026

Sample size and model choice: When GBMs outperform DNNs in credit risk

In a previous article, we looked at how machine learning models behave over time. One of the main conclusions was that well-built GBMs and DNNs can maintain strong performance for longer than many people expect.

But if both approaches are available, how should you decide which to use?

A commonly held assumption is that Deep Neural Networks require very large datasets, while Gradient Boosting Machines tend to perform well on the smaller, structured datasets typical in credit risk modelling. 

Our own research using nearly 10 years of experience in credit risk modelling using these techniques suggests that this assumption is not correct.

When we analysed model performance across different sample sizes, we observed a clear crossover effect in the opposite direction. With smaller samples, DNNs often delivered stronger performance than GBMs. As sample sizes increased, GBMs began to pull ahead.

This has implications for lenders working with smaller portfolios, new products, or limited default histories. Model choice should not be based on general assumptions about which technique “usually works best”. It needs to reflect the structure and size of the available data.

In this article, we look at what drives this behaviour and what it means for model development in practice.

Why sample size matters in credit risk modelling

When people talk about sample size in credit risk modelling, they are usually referring to the number of observed defaults rather than the total number of applications.

Defaults contain most of the information a model needs to learn how risk behaves. A dataset with hundreds of thousands of records can still be relatively small from a modelling perspective if the number of bad outcomes is limited.

This is why many modelling guidelines focus on the number of bads when assessing whether a dataset is sufficient for development.

Smaller portfolios, new products, or niche lending segments often have relatively few observed defaults. In those situations, the model has less evidence to learn from, which makes development more sensitive to how the model is configured and selected.

That is where differences between modelling techniques start to appear.

Some approaches are better at extracting stable patterns when data is limited. Others tend to perform better once larger samples are available and more complex relationships can be estimated reliably.

This is exactly the pattern we observed when comparing GBMs and DNNs across different sample sizes.

What we observed when comparing GBMs and DNNs

When we compared GBMs, DNNs, and linear models across different sample sizes using a consistent methodology (including an extensive hyperparameter search), we see a consistent trend.

With smaller samples, particularly where the number of bad outcomes was below roughly 1,000, DNNs tended to deliver stronger predictive performance than GBMs.

As the number of bads increased, the picture changed. GBMs began to perform more strongly and eventually overtook the DNN models.

This crossover is illustrated in the chart above.

One interesting detail from the analysis was how GBMs behaved on the smallest samples. In some cases, their performance fell below that of a simple linear benchmark. This is usually a sign that the model has fitted too closely to the development data and has started to capture noise rather than genuine predictive structure.

DNNs showed a different pattern. While their performance also improved as sample size increased, they tended to remain more resilient when the number of bads was limited.

Taken together, the results suggest that the common assumption (that neural networks only become useful with very large datasets) does not hold in credit risk modelling. In smaller portfolios, the opposite appears to be true.

Why this happens

The crossover largely comes down to how the two model types learn from data.

GBMs build models sequentially. Each tree is designed to correct the errors made by the previous one. When there is plenty of data, this process works very well. The model can gradually identify increasingly subtle patterns in the dataset.

When the sample is small, the same behaviour can become a weakness. With fewer bad outcomes to learn from, the model may start fitting patterns that only exist in the development period. In other words, it begins to capture noise rather than genuine predictive structure.

One likely reason is that DNNs can be regularised and this helps them generalise more smoothly when the number of bads is limited. Techniques such as dropout prevent the model from relying too heavily on any single part of the network. The model is forced to learn broader patterns that hold across the data rather than highly specific relationships.

When the dataset is limited, this tends to produce more stable behaviour. Performance may not improve as quickly as sample size grows, but it is less likely to deteriorate sharply when the number of bads is small.

This difference in learning behaviour helps explain the pattern we observed in the research. GBMs often excel once larger samples are available, while DNNs can hold up better when data is more limited.

What matters most in practice, however, is recognising that neither approach is universally better. Their relative performance depends heavily on the data available to train them.

GBMs vs DNNs: What this means in practice

For many lenders, sample size is not something that can easily be changed. New products, specialist portfolios, and challenger banks often operate with relatively limited default histories. In those situations, the choice of modelling technique becomes more important.

The results suggest that DNNs should not be ruled out in credit risk settings where the number of bad outcomes is limited. In some cases, they may provide stronger discrimination than GBMs when the number of bads is limited.

Equally, once portfolios grow and more outcomes are observed, GBMs often become very strong performers. Their ability to model complex relationships tends to show through as more data becomes available.

So essentially, model choice should be driven by evidence from the data rather than assumptions about which technique is “supposed” to work best.

Running GBMs and DNNs side by side makes those trade-offs much easier to see. When both models are developed using the same data and validation framework, the relative strengths usually become clear quite quickly.

This is one of the reasons we built Archetype to support multiple modelling approaches within a single workflow. It allows teams to compare linear models, GBMs and DNNs directly, using consistent validation and out-of-time testing to assess which approach performs best for the portfolio in question.

Final thought: model choice should follow the data

The comparison between GBMs and DNNs is often framed as a debate about which technique is better. In practice, the answer is usually simpler. The model that performs best is the one that fits the structure and size of the available data.

Our analysis suggests that sample size plays a bigger role than many people assume. When the number of bad outcomes is limited, DNNs can perform surprisingly well. As the dataset grows, GBMs often begin to show stronger performance.

Neither outcome should be taken as a rule. The only reliable way to understand how these techniques behave is to test them on the same data and compare the results carefully.

That is exactly the approach we follow when developing models in Archetype. By building GBMs, DNNs and linear models within the same framework, it becomes much easier to see where performance differences come from and how they hold up beyond development.

If you’re exploring machine learning models for credit risk, the most reliable way to understand how different approaches behave is to test them side by side.

Archetype allows modelling teams to build and compare linear models, GBMs and DNNs within the same development framework. That makes it easier to see how sample size, population behaviour and model configuration affect performance before a model goes live.

If you’d like to see how this works in practice, we’d be happy to talk.