blog bg left
Back to Blog

Visually Inspecting Data Profiles for Data Distribution Shifts

The real world is a constant source of ever-changing and non-stationary data. That ultimately means that even the best ML models will eventually go stale. Data distribution shifts, in all of their forms, are a major post-production concern for any ML or data practitioner.

Distribution shift issues, if unaddressed, can mean significant performance degradation over time and even turn the model downright unusable. In a production environment, where there are large volumes of often sensitive data, detecting and diagnosing these issues can become especially challenging.

Agenda

In this tutorial, we will see how we can inspect data for distribution shift issues by comparing distribution metrics and applying statistical tests for drift values calculations.

We’ll also learn how to leverage data logging to enable the processing of large volumes of data while addressing privacy concerns in order to inspect for data distribution shift issues in a production environment.

For this tutorial, you’ll need only a Python 3 environment, either on your own device or on a cloud device, such as a Google Colab Notebook.

But what is Data Distribution Shift anyway?

Unlike in traditional software development, the performance of your supervised machine learning model will degrade with time, which is known as model decay, or model degradation. One of the most common causes of model decay is due to changes in the distribution of data during production when compared to the data you used to test and validate your model. This can happen in different ways, such as changes in the distribution of input data, output data, or even changes in the relationship between the input and output.

Why is this a problem?

This is a problem because, ultimately, distribution shifts can affect the performance of your model, leading to all sorts of negative impacts on your organization. Ideally, your model’s performance should be constantly monitored. Still, it’s not always easy to have performance results readily available, because the required ground truth to do so might not be available or, if it is, it might come in a delayed fashion. In those cases, you can use the data you have as a signal or proxy for your model’s performance.

Let’s see a practical example of how we can inspect and detect distribution shifts with a simple case study.

Case Study: Covariate Shift with Wine Quality Dataset

As a case study, let’s use UCI’s Wine Quality Dataset. The goal of this task is to model wine quality based on physicochemical tests. This can be viewed as a classification task, where we predict the wine’s quality based on its features, like pH, density, and percent alcohol content.

In order to create a scenario of distribution shift, we will split the available dataset into two groups: wines with alcohol content (alcohol feature) below and above 11. The first group is considered our baseline (or reference) dataset, while the second will be our target dataset. This is an example of a covariate shift, one of the possible types of data distribution shifts, and it means that the input distribution changes between our reference and profile datasets, but the relationship between the input and output doesn’t change. Since we’re only concerned about changes in the input data, we’ll skip the model training altogether and focus on the input features.

The example used here was inspired by the article A Primer on Data Drift. If you’re interested in more information on this use case, or the theory behind Data Drift, it’s a great read!

Loading the dataframes

Let’s first download the dataframes. They are already preprocessed and split into target and reference dataframes.

import pandas as pd
target_url =
"https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/wine_target.csv"
reference_url =
"https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/wine_reference.csv"
target_df = pd.read_csv(target_url)
reference_df = pd.read_csv(reference_url)

Data logging with whylogs

In a production setting, we need ways of monitoring data that are scalable and efficient. For a number of reasons, such as storage requirements or privacy concerns, using raw data for debugging/monitoring purposes might not be feasible.

For this reason, we’ll leverage data logging to generate statistical summaries of our data, which we can then use to track changes in our dataset, ensure data quality and visualize key summary statistics. In whylogs, these statistical summaries are called profiles, which we’ll use to visualize the effect of covariate shift in our data.

First of all, we can install whylogs:

pip install whylogs

Let’s first create a profile of our target dataframe:

import whylogs as why
results = why.log(target_df)
profile = results.profile()

We can keep updating the profile by logging additional data, but for the moment let’s generate a Profile View from it, in order to continue our inspection:

profile_view = profile.view()

The profile_view is a lightweight statistical fingerprint of your dataset, which can be stored for later use or sent over to monitoring platforms. It will provide you with valuable statistics on a column (feature) basis, such as:

  • Counters, such as number of samples and null values
  • Inferred types, such as integral, fractional, and boolean
  • Estimated Cardinality
  • Frequent Items
  • Distribution Metrics: min, max, median, quantile values
target_summary = profile_view.to_pandas()
target_summary

Let’s do the same for our reference dataframe:

result_ref = why.log(pandas=reference_df)
profile_reference = result_ref.profile()
profile_view_reference = profile_reference.view()

There are a lot of other exciting features of whylogs that are out of scope for this short tutorial but are worth mentioning. One of these is the fact the generated profiles are mergeable, which means that the profiles produced can be combined with other profiles. In streaming systems, profiles can be captured over a mini-batch, and merged into different time granularities of data without losing statistical accuracy. This is made possible with a technique called data sketching, pioneered by Apache DataSketches. If you want to know more about this and other aspects of whylogs, feel free to check out our open-source repository!

Inspecting and comparing distributions with the Profile Visualizer

There are a number of ways we could inspect our profiles for data shifts and other data quality issues. The first one is to simply compare distribution metrics like mean, median or quantile values, which could be done with the Profile View obtained in the previous section. The downside of this approach is that there might be real cases of data shift that are not perceived by simply inspecting these metrics.

There are some other approaches we can take. Let’s go through them by using whylogs’ visualization module, the Notebook Profile Visualizer.

To do so, we can start by instantiating a visualizer, and setting the target and reference profiles obtained in the previous section:

from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=profile_view,
reference_profile_view=profile_view_reference)

Applying Statistical Tests

Instead of simply comparing distribution metrics, we can quantitatively measure data shift (or drift) by applying two-sample hypothesis testing. We can use these tests in order to compare two different sets of data to verify if both come from a common underlying distribution.

There are a number of different methods for this purpose. In this tutorial, we will use two of the most popular ones: K-S test (for numerical features) and chi-squared test (for categorical features). We can do so by simply generating a Summary Drift Report, which will yield the p-values for all of the common features between distributions, alongside other useful information, such as overall metrics, side-by-side histograms, and distribution metrics:

visualization.summary_drift_report()

The null hypothesis is that the samples are drawn from the same distribution, which means that a low p-value is indicative of different distributions. In this example, we see that drift was detected for all of our features.

What’s Next

In addition to statistical tests, there are other approaches you can take to tackle distribution shifts, such as visually inspecting histograms and distribution charts for individual features, which can be useful to confirm the disparity between distributions. In a more general topic, setting rule-based data validation is key in ensuring the quality of your data, which includes distribution changes, be it from external factors or systemic errors such as pipeline errors or missing data.

References

Visually Inspecting Data Profiles for Data Distribution Shifts” was originally published by Open Data Science.

Other posts

How to Validate Data Quality for ML Monitoring

Data quality is one of the most important considerations for machine learning applications—and it's one of the most frequently overlooked. We explore why it’s an essential step in the MLOps process and how to check your data quality with whylogs.

A Solution for Monitoring Image Data

A breakdown of how to monitor unstructured data such as images, the types of problems that threaten computer vision systems, and a solution for these challenges.

Small Changes for Big SQLite Performance Increases

A behind-the-scenes look at how the WhyLabs engineering team improved SQLite performance to make monitoring data and machine learning models faster and easier for whylogs users.

5 Ways to Inspect Data & Models with whylogs Profile Visualizer

Understand what’s happening in your data, identify and correct issues quickly, and maintain the quality and relevance of high-performing data and ML models with whylogs profile visualizer.

Data Logging With whylogs

Users can detect data drift, prevent ML model performance degradation, validate the quality of their data, and more in a single, lightning-fast, easy-to-use package. The v1 release brings a simpler API, new data constraints, new profile visualizations, faster performance, and a usability refresh.

Choosing the Right Data Quality Monitoring Solution

In the second article in this series, we break down what to look for in a data quality monitoring solution, open source and Saas tools available, and how to decide on the best one for your organization.

A Comprehensive Overview Of Data Quality Monitoring

In the first article in this series, we provide a detailed overview of why data quality monitoring is crucial for building successful data and machine learning systems and how to approach it.

WhyLabs Now Available in AWS Marketplace

AWS customers worldwide can now quickly deploy the WhyLabs AI Observatory to monitor, understand, and debug their machine learning models deployed in AWS.

Deploying and Monitoring Made Easy with TeachableHub and WhyLabs

Deploying a model into production and maintaining its performance can be harrowing for many Data Scientists, especially without specialized expertise and equipment. Fortunately, TeachableHub and WhyLabs make it easy to get models out of the sandbox and into a production-ready environment.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo
loading...