We’ve worked extensively with lenders on the development of machine learning models, ensuring they are fair, stable, controlled and explainable. These models drive significant business benefit, but monitoring needs to be fit for a non-linear world. In this blog, Head of Modelling, Nick Sime, offers advice on the key points to consider when monitoring Machine Learning (ML) models.
As lenders advance from the use of traditional linear modelling techniques to an ML-based approach, they will inevitably consider the suitability of their current scorecard monitoring and how this might be adapted. The good news is that many aspects of model monitoring for linear and non-linear models are shared, however there are also material differences that should be considered.
Linear model performance is normally subject to a ‘flat maximum effect’ whereby the inclusion of additional characteristics quickly reaches of point of diminishing returns. Scorecards based on linear models will typically be based on 10-20 terms, theoretically enabling results at a characteristic level to be quite closely examined.
ML techniques enable more characteristics to be added to improve performance; consequently, the models will use a greater number of inputs. Many factors will influence this, with the size of the training sample being a significant consideration. Typically, an ML model will have 3-4 times as many characteristics, multiplying the outputs and increasing the potential for key messages to be lost in the detail. The discipline of creating clear dashboards that neatly signpost the most material findings becomes even more crucial.
Monitoring overall performance
Not all aspects of monitoring need significant modification. Analysis that focuses on the score, such as (log odds) calibration and Gini coefficient, remains fundamentally important in understanding the predictive qualities of the model. Lenders may wish to benchmark recently developed ML models against their linear predecessors. While it’s clear that more complex ML models are capable of generating uplift on the development data, a natural question is whether those benefits persist out-of-time and whether the models remain stable. Certainly, it remains important to monitor this; in our experience, the uplifts observed at the point of model development are sustained in subsequent data – and degradation of performance shouldn’t occur more quickly than for traditional models.
The ability of a single ML model to perform well across multiple sub-populations is well accepted, removing the need for model segmentation often used within a linear approach. Logically, therefore, ML models are more resilient to shifts in applicant profile. Put differently, linear models are ill-suited to capturing interactions between features and so average over them, meaning that subsequent shifts in distribution drive misalignment that wouldn’t occur if the interaction were included in the model. Nevertheless, the performance of ML models isn’t immune from changes in applicant profile - particularly if there has been an influx of business from a newly introduced source that is not well represented in historic data. Our advice therefore is to continue to examine characteristic level stability.
While many lenders are well-versed in the tracking of characteristic stability, characteristic-level performance assessment is often less rigorous. Generally, there should be a very close alignment between observed and expected bad rates at the characteristic level. If not, it’s a clear sign that the model is sub-optimal and could be improved. Significant deviations should be flagged clearly within the reporting suite, and a consistent pattern of extensive misalignment should trigger a scorecard redevelopment. Where practical, characteristic alignment analysis should be extended to variables not included in the models, as any performance gaps indicate the potential to improve model performance.
Traditional misalignment approaches, using Delta scores or Marginal Information Value, look at performance at an overall ‘main effect’ level for each characteristic. This assessment doesn’t take account of the potential for a characteristic to add predictive power to the model via interactions. A characteristic might appear to be well aligned when analysed in isolation, yet might add predictive power to an ML model, meaning that such monitoring may yield some “false negatives” - features that would contribute positively if added to the model, but which are not flagged for attention by the reports. Assessment of this is complex, and perhaps beyond the scope of most regular monitoring reports, but solutions are possible.
A final area where ML models differ from linear models is their explainability. Linear models have very simple specifications that can be carefully examined at the point of model development to ensure a sensible dependence between the output and each of its inputs. This long-standing practice is key to ensuring that the resulting score assignments are defensible. The same scrutiny must also be applied in the case of ML models, but simply looking at average behaviours (e.g. Partial Dependence Plots) is absolutely not sufficient in this regard. Instead, careful examination of Individual Conditional Expectation plots is required.
For linear models, the desired relationships that are achieved at the point of development necessarily persist into the future and need not be continually re-assessed. The same is not true of ML models - their complexity means that the behaviour of the model on new data cannot be known with certainty in advance (unless “interpretability by design” is achieved by constraining the model behaviour at the point of development – a feature of our Archetype software) and so must also be continually re-evaluated over the life of the model. Doing so requires a quantitative measure of the level of adherence to desired behaviour. We use statistics derived from Individual Conditional Expectation data for this purpose, though extensions of LIME and SHAP methodologies may also be suitable. In any event, detecting undesirable behaviour after the event through monitoring is significantly undesirable. Better to eliminate it at the outset via “interpretability by design”.
With appropriate data management processes in place, and the ability to retain Metadata structures used within the creation of models, an assessment of the uplifts associated with a model refresh is typically recommended on a biannual basis, or more frequently for rapidly evolving or new portfolios.
Although well-designed scorecard monitoring can help identify areas of potential model weakness, a periodic retrain of the model is the most reliable way of assessing whether your ML model is optimally leveraging the complex patterns within your data.