blog bg left
Back to Blog

A Solution for Monitoring Image Data

As machine learning ecosystems become increasingly complex and data volumes grow at a dizzying rate, maintaining observability of your data and model health is more critical than ever. Data drift, concept drift, and data quality degradation can cause models to fail. In the worst cases, these are silent failures and can go undetected for months.

When working with tabular datasets, data quality can be monitored by capturing telemetry around missing values, cardinality, and data types for each of a dataset’s features. Data drift can be monitored by capturing descriptive statistics and approximate distributions of your data and computing statistical distances such as the Hellinger distance or KL Divergence. However, it’s not obvious how a similar approach can be applied to image data. As we will see, monitoring unstructured data such as images can be achieved by capturing telemetry which is structured and therefore, compatible with common statistical approaches.

Before moving forward, we must consider the types of problems that threaten computer vision systems. Once we’ve established this much, we can design a solution around these challenges.

Physical Factors

There are a variety of physical factors which can impact the consistency and quality of image data such as…

  • Device used
  • Device Settings
  • Changes in environment
  • Changes in the object(s) being detected

Hardware is an obvious example. Different devices (or device settings) can impact the size and resolution of images, along with many other properties. Suppose a healthcare company upgrades to a medical imaging device with a higher resolution. While these images may make diagnoses easier for human doctors, computer vision systems trained on images generated by the original device may actually perform worse.

Physical changes in the environment pose another challenge. For example, consider a computer vision system responsible for performing inspections of products on an assembly line. If the factory begins using a different type of light bulb, this could impact the images in a way that the machine learning model was not prepared to handle. While this is not likely to throw a hard error, the model’s performance can plummet, unbeknownst to those responsible for the model.

Aside from hardware and lighting, there are a countless number of potential challenges that can present themselves directly in the scene being imaged. There can be changes in the image background, the size or number of target objects (object detection), or target objects which are significantly different from anything encountered in the training set.

Images often have inconsistent lighting, backgrounds, image quality, object sizes, object counts, etc.

Data Pipeline Factors

Even if we can ensure that the raw images are consistently representative of our training data, these images often travel through a complex pipeline, which introduces many possible points of failure such as…

  • Swapped color channels
  • Inconsistent color spaces
  • Inconsistent scaling

Different image processing frameworks may read in color channels in a different order, causing a channel swap when migrating to a new tool. A newly introduced bug may result in inconsistencies in whether grayscale or colored images are passed to our model or in the way that pixel values are scaled. Suppose we train a model that expects pixel values ranging from 0-255. If we begin scaling these values from 0.0-1.0, this could result in a model effectively being fed black, featureless images.

Data pipeline issues may cause inconsistencies in pixel value scaling, channel ordering, or the number of channels

The Solution

In order to effectively monitor the issues described above, we need to compute metrics that are sensitive to these events. For example, calculating the mean pixel value of an image can serve as a measure of image brightness and can be leveraged to monitor for changes in image lighting. Most image processing tools can compute hue and saturation which provide information about the color palette of an image. These quantities can be used to monitor for things like changes in image backgrounds or issues such as the swapping of color channels in an image. Monitoring image height can be done trivially by capturing the shape of the tensor representing the raw image data. The number of channels in this tensor can be used to infer the colorspace of an image (RGB, CMYK, Grayscale).

In many cases, valuable metadata is available in the image file in the form of Exif data. Exif data often includes information such as the device make and model, the camera settings while taking the photo, as well as geolocation, date, and time associated with the image. Consider a model trained to identify plant species from images. Capturing geolocation from image exif data can help to inform ML engineers whether new images are expected to contain plant species that weren’t included in the training dataset.

Example of camera settings extracted from Exif data

Monitoring at Scale

Now that we know what kind of telemetry we wish to capture for images, how can we turn this into a full-fledged monitoring solution? The team at WhyLabs has developed an open source data logging library, whylogs, which was designed to capture valuable telemetry in an efficient and customizable way for any dataset. whylogs is designed with customizability as a priority, enabling users to integrate whylogs with any data pipeline, whether you’re working with image data, tabular data, text, or something else.

Furthermore, users can leverage powerful anomaly detection, informative visualizations, and automated notifications by uploading profiles to the WhyLabs AI Observatory for an end-to-end monitoring solution. Your first model is free!

A Solution for Monitoring Image Data” was originally published by Open Data Science.

Other posts

How to Validate Data Quality for ML Monitoring

Data quality is one of the most important considerations for machine learning applications—and it's one of the most frequently overlooked. We explore why it’s an essential step in the MLOps process and how to check your data quality with whylogs.

Small Changes for Big SQLite Performance Increases

A behind-the-scenes look at how the WhyLabs engineering team improved SQLite performance to make monitoring data and machine learning models faster and easier for whylogs users.

5 Ways to Inspect Data & Models with whylogs Profile Visualizer

Understand what’s happening in your data, identify and correct issues quickly, and maintain the quality and relevance of high-performing data and ML models with whylogs profile visualizer.

Visually Inspecting Data Profiles for Data Distribution Shifts

This short tutorial shows how to inspect data for distribution shift issues by comparing distribution metrics and applying statistical tests for drift values calculations.

Data Logging With whylogs

Users can detect data drift, prevent ML model performance degradation, validate the quality of their data, and more in a single, lightning-fast, easy-to-use package. The v1 release brings a simpler API, new data constraints, new profile visualizations, faster performance, and a usability refresh.

Choosing the Right Data Quality Monitoring Solution

In the second article in this series, we break down what to look for in a data quality monitoring solution, open source and Saas tools available, and how to decide on the best one for your organization.

A Comprehensive Overview Of Data Quality Monitoring

In the first article in this series, we provide a detailed overview of why data quality monitoring is crucial for building successful data and machine learning systems and how to approach it.

WhyLabs Now Available in AWS Marketplace

AWS customers worldwide can now quickly deploy the WhyLabs AI Observatory to monitor, understand, and debug their machine learning models deployed in AWS.

Deploying and Monitoring Made Easy with TeachableHub and WhyLabs

Deploying a model into production and maintaining its performance can be harrowing for many Data Scientists, especially without specialized expertise and equipment. Fortunately, TeachableHub and WhyLabs make it easy to get models out of the sandbox and into a production-ready environment.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo