blog bg left
Back to Blog

Detecting and Fixing Data Drift in Computer Vision


If you have been working in Data Science and ML for a while you know that the models you have trained can perform very unexpectedly in a production environment. This is because the production data tends to be different from the closed train dataset you used to create the model. Additionally, the production data keeps changing over time so even models that initially perform well will degrade over time.

Image by the Author

What has been described above is known as data drift and it is very common in ML.

There have been plenty of articles explaining the concept in-depth but in this tutorial, we will focus on the practical part of data drift detection and fixing it on a computer vision example.

We will give you the overall explanation of the data drift monitoring system we have created here and for more details and full code, you can check this colab notebook.

Problem statement

Cat or Dog? — Image by the Author

In this case study, we will monitor a computer vision classifier that has been trained to distinguish between cats and dogs. The actual model that we have used has been trained using Keras and is just a simple CNN.

The data that will monitor will simulate one-week production performance so we have divided images into day batches (day_1, day_2, …, day7).

Additionally, we simulated the data drift in later days, so in the first batches, you will see just normal images of cats and dogs…

Whereas later days had some camera problems, some pixels got corrupted and images look more like this…

Camera problems — Image by the Author

Let’s see if we are able to detect it with our drift monitoring system.

How to detect data drift and get early alerts

We’ll use the open-source library, whylogs to profile our data and send the profiles to WhyLabs for data drift and ML performance monitoring.

Profiles created with whylogs are efficient, mergeable, and only contain essential summary statistics. This allows them to be used in almost any application, whether it requires processing data in batches or streams, or in sensitive industries like healthcare and finance. Read more about whylogs on GitHub.

WhyLabs makes it easy to store and monitor profiles created with whylogs to detect anomalies in machine learning pipelines such as data drift, data quality issues, and model bias.

We covered details of monitoring our computer vision data and model in our Colab notebook for this post but below you can see a preview of the WhyLabs dashboard.

Detecting Data Drift in WhyLabs — Image by Author

On the left-hand side of it, you can see different features that are being monitored (brightness, pixel height, pixel width, saturation, and more). Several of them have a red exclamation mark next to them. This suggests that our data has drifted in those dimensions. You can explore each feature in detail (the dashboard's center). The screenshot above shows values for brightness and you can see that values for this started to drift on the 4th day of drift monitoring and continued to do so on the fifth, sixth, and seventh days.

That would make sense, remember that we have applied some pixelation to images sent for later days.

Model drift detection with human annotation

We already have been informed that some of the data on later days look a bit different from the data on which our model was trained. But is the model’s performance affected?

In order to check it will need human annotation. We could do it ourselves but it is highly impractical and not scalable so we will use Toloka crowdsourcing platform.

In order to do it we first need to set up a labeling project where we specify instructions for labelers and design an interface. The latter can be done using the Template Builder facility in Toloka as shown below.

Task interface for classification project — Image by the Author

Once we have set up a project we can upload all photos that need annotations. We can do it programmatically using python and Toloka-Kit library. Remember that you have the full code needed to set up this system in this notebook.

Once your day batches are uploaded you should have them organized in a daily manner as seen below.

Image by the Author

Before the labeling starts we need to choose performers that will take part in the project. Because it is a simple task we do not need to do complex filtering techniques. We will allow annotators that speak English (as this is the language in which the instructions are written) to participate.

Additionally, we have set up some quality control rules such as CAPTCHAs and control tasks. This is important as crowdsourcing to be an effective tool needs rigid quality control rules. We also sent the same image to be labeled by three different annotators to have more confidence about each label.

When all annotators finish the work we can send this data (our gold standards) together with model predictions and analyze them further with WhyLabs.

Comparing model predictions with human annotations

We can now use WhyLabs to monitor machine learning performance metrics by comparing model prediction values and ground truth. Monitors can also be configured to detect if these metrics change.

Below you can see a dashboard with accuracy, precision, recall, f-score, and confusion matrix for our case study. Note that the accuracy of the model has fallen from around 80% to 40% on later days. It looks like the initial alerts we had from input features at the beginning of this case study were right and are now confirmed by the ground truth. Our model has drifted!

Monitoring Machine Learning Performance Metrics in WhyLabs — Image by Authors

Once we have discovered models drift the usual procedure would be to retrain the model on new examples to account for the changes in the environment. As we have already annotated the new data with Toloka we could now use it to retrain the model. Or if we think that the new sample is too small we would trigger the larger data labeling pipeline to gather more training examples.


This tutorial taught you how to set up an ML model drift monitoring system for a computer vision project. We used a simple classification example of cats and dogs but this case study can be easily extended to more complex projects.

Remember that you can check the full code in this notebook. Also, we had run a live workshop where we explained step by step what is happening at each stage of this pipeline. It is a long recording but worth a look if you are interested in a more detailed explanation.

I also would like to thank Sage Elliott who is the co-author of this article and Daniil Fedulov who coauthored with us the initial notebook.

Detecting and fixing data drift in Computer Vision was originally published by Towards Data Science.

Other posts

Glassdoor Decreases Latency Overhead and Improves Data Monitoring with WhyLabs

The Glassdoor team describes their integration latency challenges and how they were able to decrease latency overhead and improve data monitoring with WhyLabs.

Understanding and Monitoring Embeddings in Amazon SageMaker with WhyLabs

WhyLabs and Amazon Web Services (AWS) explore the various ways embeddings are used, issues that can impact your ML models, how to identify those issues and set up monitors to prevent them in the future!

Data Drift Monitoring and Its Importance in MLOps

It's important to continuously monitor and manage ML models to ensure ML model performance. We explore the role of data drift management and why it's crucial in your MLOps pipeline.

Ensuring AI Success in Healthcare: The Vital Role of ML Monitoring

Discover how ML monitoring plays a crucial role in the Healthcare industry to ensure the reliability, compliance, and overall safety of AI-driven systems.

WhyLabs Recognized by CB Insights GenAI 50 among the Most Innovative Generative AI Startups

WhyLabs has been named on CB Insights’ first annual GenAI 50 list, named as one of the world’s top 50 most innovative companies developing generative AI applications and infrastructure across industries.

Hugging Face and LangKit: Your Solution for LLM Observability

See how easy it is to generate out-of-the-box text metrics for Hugging Face LLMs and monitor them in WhyLabs to identify how model performance and user interaction are changing over time.

7 Ways to Monitor Large Language Model Behavior

Discover seven ways to track and monitor Large Language Model behavior using metrics for ChatGPT’s responses for a fixed set of 200 prompts across 35 days.

Safeguarding and Monitoring Large Language Model (LLM) Applications

We explore the concept of observability and validation in the context of language models, and demonstrate how to effectively safeguard them using guardrails.

Robust & Responsible AI Newsletter - Issue #6

A quarterly roundup of the hottest LLM, ML and Data-Centric AI news, including industry highlights, what’s brewing at WhyLabs, and more.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo