Case study
"We chose WhyLabs as an observability capability for our ML Platform because of the ease of integration and rich capabilities that enable us to meet Model Health Equity Governance guidelines and minimize time-to-insight across model operation tasks."
Engineering Director, Major Healthcare Provider
Introduction
Monitoring image data for machine learning (ML) is important for any large healthcare company running medical image recognition and classification models. In this case study, we will explore how a large healthcare provider uses WhyLabs to predict the likelihood of patients developing certain conditions, diagnose diseases, and identify potential treatments. These models are trained and evaluated on large datasets containing medical images, such as X-rays, CT scans, and MRIs.
The Challenge: Medical data privacy, sensitivity, and image size
There were several key challenges to monitoring image data that we identified from conversations with their ML team:
- Medical images can be very large, making it difficult to store and manage the data effectively.
- The quality of medical images can vary greatly, affecting the accuracy and performance of the ML model.
- The privacy and security of medical images is of utmost importance, as they may contain sensitive patient information.
- In addition to monitoring the model's overall performance, it was important for their team to monitor the model's fairness and bias by evaluating the model's performance across different subgroups of the data, such as different genders and races. If the model exhibits significant disparities in performance across these subgroups, it may be necessary to adjust the model or the data to address these issues.
Overall Setup
To effectively monitor their image data, it was essential to follow a structured approach that included the following:
- Data collection and preprocessing: The first step in monitoring image data is to collect and preprocess the data. This typically involves acquiring the images from different sources, such as hospitals and clinics, and organizing them into a dataset. The dataset should be carefully curated to ensure that it contains a representative sample of the population and that the images are high-quality. In addition, the data should be preprocessed to extract relevant features and prepare it for training the ML model.
- Model training and evaluation: The next step is to train and evaluate the ML model on the preprocessed image data. This typically involves selecting an appropriate model architecture and training algorithm and tuning the hyperparameters to achieve the best performance. The trained model should then be evaluated on a separate test dataset to assess its accuracy and performance.
- Model deployment and monitoring: Once the ML model has been trained and evaluated, it can be deployed in a production environment. However, monitoring the model to ensure it continues to perform well over time is important. This can be done by regularly collecting and preprocessing new image data, and comparing the model's performance on the new data to its performance on the training and test datasets. If the model performance degrades significantly, it may be necessary to retrain the model on the updated data.
The solution: Monitoring derived image metrics with whylogs and the WhyLabs AI Observatory platform
There are two approaches to monitoring unstructured data: monitoring embeddings directly and monitoring derived metrics appropriate for the use case. Generic embeddings are easier to monitor out of the box, but it can be hard to use them to debug problems with models because embeddings are hard to interpret. Conversely, extracted metrics require more thoughtfulness to set up, but the result is that you can better understand what's happening with your data and more easily perform root cause analysis to debug problems with your model.
Image data, along with other embeddings, are commonly represented in the form of high dimensional vectors often paired with additional feature and metadata columns. However, images contain important structural information (such as correlations between nearby pixels and interpretations of color channels) that can be harder to discern when viewing them as a generic embedding. Additionally, embeddings resulting from a single image from this case study had images as large as 2000 x 2000 pixels across 3 color channels, producing generic embeddings of over 12 million dimensions. For these reasons, generic embeddings are infeasible, and the customer elected to monitor derived image metrics.
The metrics are uploaded to the WhyLabs AI Observability platform, where the information is visualized and monitored over time.
Outcome: Meaningful and actionable insights from image data out-of-the-box
Using both whylogs and the WhyLabs AI Observability platform, the customer was able to immediately find valuable insights into their image data without any need for customization. WhyLabs' open source library, whylogs, automatically computes a number of metrics relative to image data. These include the following:
- Image dimensions (pixel width, pixel height)
- Brightness (mean, standard deviation)
- Hue (mean, standard deviation)
- Saturation (mean, standard deviation)
- Colorspace (e.g. RBG, HSV, CMYK)
Outcome: Ability to quickly augment with additional metrics specific to different teams and model use cases
Due to the customers' variety of images and modeling tasks, it was important for the customer to have the flexibility to extend and change the metrics that are captured about the images, features, and additional metadata over time and for each unique modeling task.
The customer can calculate additional derived metrics as needed for specific use cases. These include collecting derived metrics for bounding boxes or image quadrants, EXIF image metadata, edge detection, gradient orientation (HOG) features, and more using our Python, Spark, and Java versions of whylogs.
Outcome: Discover unknown patterns in image data and cut down resolution time on model performance issues
Before WhyLabs, the customer underwent a manual exploration process through the images in an attempt to diagnose variations in image quality. However, this process was time intensive and only resulted in a handful of notable images. It was infeasible to translate insights about these individual images to broader trends in data quality and model performance.
With WhyLabs, the customer was able to monitor semantically meaningful metrics about their image data over time. They can also perform root cause analysis and identify several issues present in the image data, including differences in brightness and hue across capture devices and changes in image dimension that caused a drop in model performance.