Predicting when machine learning models fail in production - Naver Labs Europe

Reducing risks and advancing the decision-making process for Natural language Processing Models in production 

This blog post is attached to our published EMNLP2019 paper:
To Annotate or Not? Predicting Performance Drop under Domain Shift.
Hady Elsahar and Matthias Galle


The moment you put a model in production, it starts degrading. Building Machine Learning models that perform well in the wild at production time is still an open and challenging problem. It is well known that modern machine-learning models can be brittle, meaning that — even when achieving impressive performance on the evaluation set — their performance can degrade significantly when exposed to new examples with differences in vocabulary and writing style. This discrepancy between examples seen during training use and inference can cause a drop in performance of ML models in production, which can pose non-affordable risks for some critical applications considering data that evolves quickly such as malware detection or sentiment analysis for automatic trading. It became common knowledge for ML practitioners that there is no solution that will work perfectly forever. 

Knowing all this, continuous monitoring and maintenance of ML models in production become necessary. Practitioners continuously sample and annotate evaluation input examples from the stream of input data in production and evaluate their model against using several methods. This maintenance process is inevitably time and resource consuming. The continuous manual labelling of evaluation datasets does not only cost time and money but also prohibits anticipating risks before they occur. 

continuous manual labelling of evaluation datasets imageMore crucially, this expensive maintenance process will continue forever as long as one would want a decent performance of their ML models that are deployed in production. Motivated by literature work from domain-shift and out of distribution detection,we propose a method that can predict the performance drop of a model when evaluated on a new target domain, without the need for any labelled examples from this target domain. Performing this estimation when done accurately and in real-time can have an important impact on the decision process of debugging and maintaining machine learning models in production. For instance, such insights can drive the decision to annotate more data for retraining or even adjusting models accordingly (e.g. performing unsupervised domain adaptation). 

Our proposal is in two points: 

  1. We propose a set of domain shift detection metrics along side with their adaptations. Values of those metrics should correlate well with the performance drop of ML models in production.
  2. A regression method to directly learn the value of performance drop of an ML model when exposed to domain shift.

1. Domain-Shift Detection Metrics 

The problem of discrepancy between training and inference in the machine learning literature is generally referred to as Domain shift or dataset shift. Domain shift occurs when the in/out joint probability of source and target domains are shifted (more formally Pₛ(x, y) ≠ Pₜ(x, y)). This could be because of Covariate Shift, where your model in production start seeing input examples from a different input distribution than those seen in the training (Pₛ(x)≠Pₜ(x)); or due to Concept Shift, where similar examples are expected to be classified differently due to shifting in the domain context i.e. (Pₛ(y|x) ≠ Pₜ(y|x)) or even a shift in the label distribution (label shift) (Pₛ(y)≠Pₜ(y)). 

There has been a lot of Domain shift detection metrics concerned with one or some of those domain shift types. We experiment with three metrics from three different families alongside with their adaptations:

  • Metrics motivated by the H-divergence literature [2] such as proxy A-distance. 
  • Confidence estimation and calibration for out of distribution detection[3]
  • Reverse testing [4]. 

While those metrics come from well-studied lines of work, our proposed adaptations try to mitigate some of their problems in practice. For example, the family of H-divergence based measures are task and model agnostic this makes them prone to fail when there is a severe change in the marginal distribution that is task-irrelevant. For more details about the selected metrics and our proposed modifications, we invite you to read the paper.

Proxy A-distance image

A figure showing how the Proxy A-distance — an example of a domain shift detection metric for covariate shift  — being calculated in practice using a domain classifier.


adaptation classic Proxy A-distance image

Our proposed adaptation for the classic Proxy A-distance metric to incorporate task-specific features

Domain shift → Drop in performance? 

One of the general problems with such metrics is that their absolute values cannot be directly translated as the performance drop of the model on a specific task. To investigate this further we conduct large scale experiments to show the correlations between values of domain shift detection metrics and their actual performance drop. For that, we rely on large scale datasets containing 5 million sentences and more than 500 of simulated domain shift scenarios for sentiment analysis and part of speech tagging tasks. From the figure below we can see that while there is a general correlation, linear correlations within single models are more prevalent. Therefore, one can conclude that different models behave differently to domain-shifts of the same magnitude and absolute values of domain-shift detection metrics are Model dependent.

values of a domain shift detection image

A scatter plot showing the correlation between the values of a domain shift detection metric and the actual drop in the performance of the trained models (each model is represented a single colour).

2. Direct prediction of performance drop

A solution we propose to directly predict the drop of performance is by learning the model in fragility to different amplitudes of domain shifts

Given a small fixed number of labelled evaluation datasets from different source domains one can fit a regression line between the drop in the model accuracy and a domain-shift detection metrics of choice. We can then use this regression line to predict the performance drop of this model when evaluated on the target domain. Assuming the availability of a small fixed-size of evaluation datasets is a non-negligible cost, however, our proposed method does not require any labels from the target domain which allows this check to be performed in run-time and therefore has a large impact on the decision process in the runtime. More importantly, this is an overhead that will have to be done only once and not forever as common maintenance overheads of ML models in production. 

to_annotate_or_not aniated gif image

This figure shows our method which employs a small set of evaluation datasets and best practices of domain shift detection methods to predict the actual drop in performance of machine learning models in production.

Our results on classification and sequence labelling show that our method is able to predict performance drops with an error rate as low as 2.15% and 0.89% for sentiment analysis and POS tagging respectively using one of our proposed modified metrics.

Mean absolute error (MAE) and max error (Max) image

Mean absolute error (MAE) and max error (Max) of the performance drop prediction (Lower is better)

We also show that one can achieve very adequate results with only few annotated evaluation datasets. 

annotated evaluation datasets image


To conclude, we propose a method that can directly predict with high accuracy the drop in performance a model in production will suffer when exposed to unseen examples. Our method is cheap and can be applied in runtime which can have a large potential in advancing the decision-making process for Machine Learning in production and reduce dramatically risks and maintenance costs. 


1- Joaquin Quionero-Candela et al. 2009: Dataset shift in machine learning, The MIT Press

2- Shai Ben-David et al. 2010: A theory of learning from different domains, Machine Learning.

3- Kimin Lee et al. 2018: Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples, ICLR

4- Wei Fan et al. 2006. Reverse testing: an efficient framework to select amongst classifiers under sample selection bias, SIGKDD

To Annotate or Not? Predicting Performance Drop under Domain Shift | EMNLP-IJCNLP2019

This article was first published on the 30th October 2019.