AI-based scoring is largely replacing the traditional scorecard, with lenders taking advantage of explainable machine learning techniques to maximise the uplift they get from their data. But AI has done more than just transform the way lenders construct their application models – it also calls for an evolution of the traditional scorecard monitoring approach. In this blog post Jaywing’s Head of Modelling, Nick Sime, reflects on his experience of developing these models for over 40 different lenders and gives some best practice advice on how to advance your monitoring activity.
At the time Jaywing launched Archetype, our multi-award winning AI-based modelling platform, almost all lenders were using tried and tested linear-based approaches for the creation of application scores.
Since this point, a rapid change has occurred within the industry, led in the main by challenger banks and followed by traditional counterparts, with new models now almost exclusively based on an AI-based approach. Having overcome the ‘black box’ problem, explainable approaches like the one we’ve deployed within Archetype have thrown open the doors to enable robust, stable models to transform the way lending decisions are made.
Indeed, this has been a positive move for lenders, with uplifts in discrimination being particularly strong for multi-bureau models and or fraud-based outcomes where evidence of unusual patterns or discrepancies can be exploited.
The benefits of efficient credit scoring models are well understood, enabling higher accept rates and/or lower levels of arrears. However, in the increasingly dominant on-line broker-based markets, the ability to set pricing strategy accurately in order to convert profitable business and provide an additional competitive edge against lenders is still largely done using weaker traditional models.
Although the nature of linear and AI-based models are very different, the requirements of monitoring performance and determining when a rebuild is required are similar – albeit with some nuances.
Linear models are, of course, simpler. Typically, optimal discrimination is achieved within when using 15-20 input characteristics, meaning that the ongoing task of reviewing the stability and performance of each input is not unduly onerous.
While much depends on the specifics of the development data, our experience is that optimal discrimination within AI models is achieved with many more features than within a linear approach. This is not surprising given the exponential increase in the number of interactions that are possible as the number of inputs increases. Inevitably, the law of diminishing returns applies here, and a manageable number of features can be used whilst leaving relatively little predictive power untapped. It is inevitable that the number of inputs within an AI-based model will exceed the numbers used in linear models. This presents a complex challenge for characteristic-level reporting: the use of dashboards to flag potential issues has never been more important.
Much of the characteristic stability reporting performed by lenders is designed to detect a shift in the population from the baselines seen from within development samples. Although characteristic instability does not necessitate a model rebuild, this does alert the model manager to the risk that degradation may be evident in the future when a typical business has reached maturity.
Linear models are quite specific to the population upon which they are developed - hence the keen interest in population stability. A highly desirable feature of AI models is they can provide optimal performance at a segment level, and it can be argued that a shift in mix of business might be less impactful.
There are also counter-arguments regarding the potential instability of AI-based models that are predicated on interactions and the extent to which these might prove unstable over time. The evidence we have seen so far is encouraging, and Jaywing’s belief is that model stability from our new generation of models is at least as stable as the simpler models of the past.
A common, and sometimes concerning, feature of AI-based models is the degree of overfitting observed during the development process. Any marked over-fitting we be evident within retrospective scorecard monitoring with higher levels of discrimination and steeper alignment patterns coinciding with sample window used within the model development. This is something that can be controlled with suitable tuning of hyper-parameters; models developed within Archetype are subject to far lower levels of overfitting than many of those developed elsewhere, whilst still retaining strong uplifts on independent test data.
The primary function of scorecard monitoring is to indicate when a model should be redeveloped. In the circumstance when the population is stable, discrimination is in-line with expectations, and the observed and expected odds are aligned at characteristic attribute level it is generally accepted that linear model performance is near-optimal.
Whilst the above observations would all be highly encouraging in the context of monitoring an AI-based model, the level of certainty the model is optimal can never be as high. The uplift achieved by the new generation of models is achieved based on complex patterns or interactions, and it is not easy to isolate, assess or measure the predictive stability here in a prescriptive report-based format.
Consequently, we believe the best way to assess the optimality of an AI model is to build a challenger model. While some investment in the development process is required, this can be made easier using metadata and other inputs within the original development. These should also include equivalent explainability constraints too. Archetype models make this easy, enabling new models to be constructed so that they follow exactly the same rules as ones that have already been agreed upon by the business.
So, whilst traditional monitoring processes remain of value and may highlight areas of instability or under-performance, a re-training process will be needed to gain a closer understanding of the uplifts that might be achieved from a rebuild of an AI-based model. Using tools like Archetype, your monitoring activity can effectively generate the ‘best case’ alternative model, meaning you can go straight to implementation of an improved version should you wish to.
For more information about how Archetype can optimise your model development, or to discuss how we can help you advance your model monitoring capability to support non-linear models, contact us.