At least since the publication of the European AI Act proposal, you should be considering how to monitor high-risk AI systems running in production. In this article, we’ll introduce you to the concept, detection setups, and (un)supervised approaches for monitoring machine learning models. We’ll also talk about existing tools for concept drift detection.
Why Should You Care About Concept Drift?
Legislators and scholars have called for more transparency and explainability of AI for a while now. The result: A growing number of regulatory organs make steps toward setting upcertification standards and legislation to ensure save AI applications. The international standardization organization (ISO), the german standardization organization (DIN), and most importantly, the European Commission with their European AI Act have begun to introduce new regulations. A vital requirement of these upcoming regulations will be mandatory model monitoring for AI systems running in production.
What Is Concept Drift?
Monitoring models in production is imperative since the performance of machine learning models tends to degenerate over time, commonly referred to as model degradation. Models usually deteriorate over time in production due to two possible reasons. First, there could be a mismatch between the data distribution of the training dataset and the real world data. Second, the environment that the model is deployed in is dynamic, which can cause the data distribution to change over time. One good example of this would be the impact that COVID-19 had on the supply chain and consumer behavior. Models trained before the pandemic were no longer fitting due to the underlying change in the distribution of the input data.
The concept of the changing input is widely called concept drift and is the main factor contributing to model deterioration. Multiple influences can cause concept drift, such as degradation of measuring equipment, seasonal changes, changing user preferences, etc.
Different Types of Concept Drift
When looking at concept drifts, you can find a substantial body of literature. However, the terms that are used vary widely. To give an overview, Bayram et al. extracted the following types of concept drifts, based on their temporal occurrence:
Concept drift can occur in multiple locations. It can affect …
… the underlying data distribution.
This is called a covariate shift. For example, when the quality of the data samples gathered from a measuring device (e.g. camera or sensor) degrades over time, causing the input data to shift accordingly.
… the label distribution.
This is called a prior probability shift. For example, a production line that decreases its fault rate will cause a quality assurance model to see fewer faulty parts over time.
… the posterior distribution.
This indicates a principal change in the underlying target concept. For example, as language is dynamic, the sentiment and even definition of words and phrases can change over time. Dialects are formed and forgotten over the years due to changes in the environment and socioeconomic shifts in society.
Effect of Concept Drift
Additionally, we can distinguish between the effects that different types of concept drifts have on the performance of a machine learning model. Concept drift can have …
… no effect on the predictive power of a machine learning model. This type of concept drift is known as virtual concept drift. If you consider a classification problem with a decision boundary between two classes, virtual concept drift is any dataset shift that happens where no sample crosses from one side of the decision boundary to the other.
… local effect, if the concept drift mechanism affects the model’s performance only on parts of the data distribution. For example, the data distribution of only one class from a dataset could change. Furthermore, some samples might completely change their underlying class, which is consodered a severe concept drift. Consequently, the performance of a model on the affected parts of the data might worsen.
… global effect, if the whole data, label, or posterior distribution is affected at once. For example, this could happen in cases where the underlying measurement device is degrading, like in the case of a camera lens getting scratched and fogged over time. In this case, the performance of the ML model on all samples might be reduced.
Occurrence of Concept Drift
So, you can differenciate between the underlying mechanisms of concept drift and its effect on model performance. But concept drifts can also be classified by how they occur in the temporal domain. Concept drift can occur …
… suddenly: Here the concept drift happens abruptly and persistently. For example, a change in manufacturing material can cause a quality assurance model to suddenly degrade in performance.
… recurrent: The concept changes continuously but always returns to its initial state. For example, seasonal effects might occur every year and influence the data distribution.
… gradually: The concept starts switching between an initial and a second state until it fully adapts to the second state. For example, one product is getting phased out and a new one is being introduced.
… incrementally: The concept drift increases continuously, with no sudden shifts. For example, a camera fogging up.
How to Detect Concept Drift?
Now that we covered laid the groundwork, let’s think about how to detect concept drift during model monitoring.
There is a big body of work on the detection of concept drift and most detectors have a very similar structure. Research in this domain offers multiple approaches for core building blocks: Gathering data, modeling the data, determining a suitable score to compare distributions, and performing significance tests. The overall literature indicates that drift detectors are usually tailored to specific use cases, with the most prominent being tabular data streams. Also, many methods are performance-based and require ground truth labels to work. However, unsupervised concept drift detection has also been receiving some attention in the past few years.
Concept Drift Detector Setup for Model Monitoring
For monitoring concept drift in your model, you need a proper drift detection setup. Let’s start with the first part of any concept drift detector: data accumulation.
1) Data accumulation: While some methods for concept drift detection in the supervised domain work on single samples, most methods use a windowing approach to accumulate the relevant data. For this we look at a window of data from a stream and compare its representation with the representation of a reference window.
The reference window…
…can be either fixed or moving. In the fixed reference frame setup, we compare our current window against a fixed data distribution of previously collected data. The reference frame can also be moving. For example, we can compare our current data distribution to that of a week ago.
The detection window…
…can be either batched or online. If the detection window is batched, we look at a batch of data at once and compare its distribution with the distribution of our reference frame. We can also consider an online detection window, where a stream of samples are continuously being evaluated. In this case, we can implement the window as a first-in-first-out queue. So for each new sample coming in, we compare the distribution of the last n samples with the distribution of our reference frame. While online calculation can provide a more granular picture of the concept drift, the batched approach uses less computation.
2) Data modeling: Once you accumulated data, it’s time for modeling your data. In this step, the concept drift detector abstracts the retrieved data and extracts the key features that impact the system if they drift. This stage is often considered as optional, as it mainly represents dimensionality reduction or sample size reduction to meet storage and online latency requirements.
3) Distribuition comparison: After modeling the data, the third stage is used to measure the dissimilarity, or distance, between the reference window and the detection window. This is considered the most challenging part of concept drift detection as defining an accurate and robust dissimilarity measurement is still an open question.
4) Significance testing: The last stage of your concept drift detection setup focuses on hypothesis tests. This way you can figure out whether the observed change is significant and whether an alarm should be triggered. This stage is needed to understand whether a change in the results of stage 3 is triggered by random sample bias or an actual concept drift.
Concept Drift Detector Approaches
Existing approaches for detecting concept drifts can be classified into two different categories depending on their reliance on ground truth labels. In other words, they can be classified into supervised and unsupervised methods. Both of these categories of approaches have potential drawbacks for monitoring your model. Performance-based and error-based methods need ground truth labels to work and are therefore not always applicable in real-life scenarios, as it is not always possible to gather the ground truth data automatically. On the other hand, data-distribution-based methods can not directly infer the performance of a model. Since they might detect virtual concept drifts.
Supervised methods for monitoring concept drift include the largest class of concept drift detectors: performance-based methods. These methods usually trace the predictive sequential error to detect changes. The idea behind these methods is that the error reflects a change in the underlying distribution, as the learned relationship is no longer valid, resulting in concept drift. The biggest benefit of these methods is that they only handle concept drift when the performance of the machine learning model is affected. But keep in mind: They rely on quick feedback about their performance which is often not available in real-world use cases.
Wares et al provide a categorization of supervised methods. The approaches are assigned to one of these three groups: Statistical methods, window-based methods, and ensemble-based methods.
The first set of methods describes approaches that use statistical tests, such as the Cumulative Sum and the Page–Hinckley Test. This group includes DDM, EDDM, and the McDiarmid drift detection method.
The second class includes window-based methods, These techniques divide a data stream into sliding windows based on data size or time interval. They monitor the performance of the most recent data points and compare them to a reference. In this category, ADWIN is a widely used option.
The last category includes ensemble-based methods. These methods train an ensemble of base learners. The overall model performance is monitored by considering the accuracy of all the ensemble members separately or in an average fashion. Not to be confused with ensembles of drift detectors. Here, SEA (Streaming Ensemble Algorithm) is a widely used option that builds on the WMA (Weighted Majority Algorithm).
The second class of methods is unsupervised methods. Unsupervised concept drift monitoring methods make up only a fraction of the work in the field, while 95% of methods use some kind of supervision only 5% of the approaches are able to work without it. Gemaque et al provide an overview and classification of these methods, based on how they do the windowing. Similar to the detection window topic above, they classify approaches into either batch or online techniques. However in general, research into unsupervised methods seems to all take a similar approach: The underlying data distribution is monitored to identify the points at which the data distribution experiences a significant change – hence they fall into the category of data-distribution-based methods.
Two very recent unsupervised learning methods mentioned are NN-DVI and FAAD. NN-DVI uses a kNN approach to model the data and then a distance function to accumulate density discrepancies and finally perform a significance test. FAAD uses a feature selection method to better cope with multi-dimensional sequences, followed by an anomaly detection algorithm using random feature sampling. The anomaly scores are then compared to a user-defined threshold to determine when a sequence is an anomaly.
More recent work deals with the challenge of more complex data and models. Baier et al borrow ideas from the active learning literature and proposes to use the uncertainty of a neural networkto detect concept drift. The idea is that as the concept drifts, the model becomes less confident in its prediction, which can for example be measured using Monte-Carlo-Dropout. The very recent method STUDD uses a novel approach to unsupervised concept drift detection based on a student-teacher learning paradigm, where the reproduction error of the student network is used as a proxy for how far the current samples are out of distribution. It can be considered a simpler version of the Uninformed Students Approach from Anomaly Detection Literature.
Existing Drift Detection Tools
When dealing with concept drift in real-world scenarios, it is helpful to rely on existing tooling that implements some of the methods described in the literature. These two open-source projects can prove useful in this regard:
“Evidently helps analyze and track data and ML model quality throughout the model lifecycle. You can think of it as an evaluation layer that fits into the existing ML stack.”
Evidently has a modular approach with 3 interfaces on top of the shared analyzer functionality:
1. Interactive visual reports
2. Data and machine learning model profiling
3. Real-time machine learning model monitoring (under development)
Limitations: Only available for tabular data
“NannyML is an open-source python library that allows you to estimate post-deployment model performance (without access to targets), detect data drift, and intelligently link data drift alerts back to changes in model performance. Built for data scientists, NannyML has an easy-to-use interface, and interactive visualizations, is completely model-agnostic, and currently supports all tabular classification use cases.”
Limitations: tabular classification
When looking at exisitng concept drift research and tooling, one can see a couple of open questions and challenges:
- Missing real-world dataset benchmarks: There is a lack of real-world data stream benchmarks to validate and compare the methods proposed in the literature.
- Labels are usually not available: The majority of existing drift detection algorithms assume ground truth labels to be present after inference. However, very little research has been conducted to add address unsupervised or semi-supervised drift detection and adaptation.
- What about complex models and data streams: There is a severe lack of methods to detect concept drift in more complex settings. Novel approaches such as STUDD and uncertainty-based concept drift detection could be utilized in these settings, however further research is still necessary.
In summary: There is a significant lack of methods to detect concept drift in complex data streams like images. On top of this, only a fraction of the literature looks at the problem of detecting concept drift without existing ground truth labels. Existing tooling focuses on simple tabular data streams and mostly classification tasks. The reality is that in many real-life use cases, you don’t have access to ground truth labels and deals with complex data, e.g. in visual quality assurance. In these scenarios, there are only limited options when it comes to monitoring concept drift, which leads to models often just being retrained and redeployed after a certain amount of time.
This leads us to the question: Are there other more generic options when it comes to models without supervision? We’ll have a look at this question in our next article. Stay tuned for part two of this concept drift deep dive!