blog bg left
Back to Blog

Visually Inspecting Data Profiles for Data Distribution Shifts

The real world is a constant source of ever-changing and non-stationary data. That ultimately means that even the best ML models will eventually go stale. Data distribution shifts, in all of their forms, are a major post-production concern for any ML or data practitioner.

Distribution shift issues, if unaddressed, can mean significant performance degradation over time and even turn the model downright unusable. In a production environment, where there are large volumes of often sensitive data, detecting and diagnosing these issues can become especially challenging.

Agenda

In this tutorial, we will see how we can inspect data for distribution shift issues by comparing distribution metrics and applying statistical tests for drift values calculations.

We’ll also learn how to leverage data logging to enable the processing of large volumes of data while addressing privacy concerns in order to inspect for data distribution shift issues in a production environment.

For this tutorial, you’ll need only a Python 3 environment, either on your own device or on a cloud device, such as a Google Colab Notebook.

But what is Data Distribution Shift anyway?

Unlike in traditional software development, the performance of your supervised machine learning model will degrade with time, which is known as model decay, or model degradation. One of the most common causes of model decay is due to changes in the distribution of data during production when compared to the data you used to test and validate your model. This can happen in different ways, such as changes in the distribution of input data, output data, or even changes in the relationship between the input and output.

Why is this a problem?

This is a problem because, ultimately, distribution shifts can affect the performance of your model, leading to all sorts of negative impacts on your organization. Ideally, your model’s performance should be constantly monitored. Still, it’s not always easy to have performance results readily available, because the required ground truth to do so might not be available or, if it is, it might come in a delayed fashion. In those cases, you can use the data you have as a signal or proxy for your model’s performance.

Let’s see a practical example of how we can inspect and detect distribution shifts with a simple case study.

Case Study: Covariate Shift with Wine Quality Dataset

As a case study, let’s use UCI’s Wine Quality Dataset. The goal of this task is to model wine quality based on physicochemical tests. This can be viewed as a classification task, where we predict the wine’s quality based on its features, like pH, density, and percent alcohol content.

In order to create a scenario of distribution shift, we will split the available dataset into two groups: wines with alcohol content (alcohol feature) below and above 11. The first group is considered our baseline (or reference) dataset, while the second will be our target dataset. This is an example of a covariate shift, one of the possible types of data distribution shifts, and it means that the input distribution changes between our reference and profile datasets, but the relationship between the input and output doesn’t change. Since we’re only concerned about changes in the input data, we’ll skip the model training altogether and focus on the input features.

The example used here was inspired by the article A Primer on Data Drift. If you’re interested in more information on this use case, or the theory behind Data Drift, it’s a great read!

Loading the dataframes

Let’s first download the dataframes. They are already preprocessed and split into target and reference dataframes.

import pandas as pd
target_url =
"https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/wine_target.csv"
reference_url =
"https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/wine_reference.csv"
target_df = pd.read_csv(target_url)
reference_df = pd.read_csv(reference_url)

Data logging with whylogs

In a production setting, we need ways of monitoring data that are scalable and efficient. For a number of reasons, such as storage requirements or privacy concerns, using raw data for debugging/monitoring purposes might not be feasible.

For this reason, we’ll leverage data logging to generate statistical summaries of our data, which we can then use to track changes in our dataset, ensure data quality and visualize key summary statistics. In whylogs, these statistical summaries are called profiles, which we’ll use to visualize the effect of covariate shift in our data.

First of all, we can install whylogs:

pip install whylogs

Let’s first create a profile of our target dataframe:

import whylogs as why
results = why.log(target_df)
profile = results.profile()

We can keep updating the profile by logging additional data, but for the moment let’s generate a Profile View from it, in order to continue our inspection:

profile_view = profile.view()

The profile_view is a lightweight statistical fingerprint of your dataset, which can be stored for later use or sent over to monitoring platforms. It will provide you with valuable statistics on a column (feature) basis, such as:

  • Counters, such as number of samples and null values
  • Inferred types, such as integral, fractional, and boolean
  • Estimated Cardinality
  • Frequent Items
  • Distribution Metrics: min, max, median, quantile values
target_summary = profile_view.to_pandas()
target_summary

Let’s do the same for our reference dataframe:

result_ref = why.log(pandas=reference_df)
profile_reference = result_ref.profile()
profile_view_reference = profile_reference.view()

There are a lot of other exciting features of whylogs that are out of scope for this short tutorial but are worth mentioning. One of these is the fact the generated profiles are mergeable, which means that the profiles produced can be combined with other profiles. In streaming systems, profiles can be captured over a mini-batch, and merged into different time granularities of data without losing statistical accuracy. This is made possible with a technique called data sketching, pioneered by Apache DataSketches. If you want to know more about this and other aspects of whylogs, feel free to check out our open-source repository!

Inspecting and comparing distributions with the Profile Visualizer

There are a number of ways we could inspect our profiles for data shifts and other data quality issues. The first one is to simply compare distribution metrics like mean, median or quantile values, which could be done with the Profile View obtained in the previous section. The downside of this approach is that there might be real cases of data shift that are not perceived by simply inspecting these metrics.

There are some other approaches we can take. Let’s go through them by using whylogs’ visualization module, the Notebook Profile Visualizer.

To do so, we can start by instantiating a visualizer, and setting the target and reference profiles obtained in the previous section:

from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=profile_view,
reference_profile_view=profile_view_reference)

Applying Statistical Tests

Instead of simply comparing distribution metrics, we can quantitatively measure data shift (or drift) by applying two-sample hypothesis testing. We can use these tests in order to compare two different sets of data to verify if both come from a common underlying distribution.

There are a number of different methods for this purpose. In this tutorial, we will use two of the most popular ones: K-S test (for numerical features) and chi-squared test (for categorical features). We can do so by simply generating a Summary Drift Report, which will yield the p-values for all of the common features between distributions, alongside other useful information, such as overall metrics, side-by-side histograms, and distribution metrics:

visualization.summary_drift_report()

The null hypothesis is that the samples are drawn from the same distribution, which means that a low p-value is indicative of different distributions. In this example, we see that drift was detected for all of our features.

What’s Next

In addition to statistical tests, there are other approaches you can take to tackle distribution shifts, such as visually inspecting histograms and distribution charts for individual features, which can be useful to confirm the disparity between distributions. In a more general topic, setting rule-based data validation is key in ensuring the quality of your data, which includes distribution changes, be it from external factors or systemic errors such as pipeline errors or missing data.

References

Visually Inspecting Data Profiles for Data Distribution Shifts” was originally published by Open Data Science.

Other posts

Re-imagine Data Monitoring with whylogs and Apache Spark

An overview of how the whylogs integration with Apache Spark achieves large scale data profiling, and how users can apply this integration into existing data and ML pipelines.

ML Monitoring in Under 5 Minutes

A quick guide to using whylogs and WhyLabs to monitor common issues with your ML models to surface data drift, concept drift, data quality, and performance issues.

AIShield and WhyLabs: Threat Detection and Monitoring for AI

The seamless integration of AIShield’s security insights on WhyLabs AI observability platform delivers comprehensive insights into ML workloads and brings security hardening to AI-powered enterprises.

Large Scale Data Profiling with whylogs and Fugue on Spark, Ray or Dask

Profiling large-scale data for use cases such as anomaly detection, drift detection, and data validation with Fugue on Spark, Ray or Dask.

Monitoring Image Data with whylogs v1

When operating computer vision systems, data quality and data drift issues always pose the risk of model performance degradation. Whylabs provides a simple yet highly customizable solution for maintaining observability into data to detect issues and take action sooner.

WhyLabs Private Beta: Real-time, No-code, Cloud Storage Data Profiling

We’re excited to announce our Private Beta release for a no-code integration option for WhyLabs, allowing users to bypass the need to integrate whylogs into their data pipeline.

Data and ML Monitoring is Easier with whylogs v1.1

The release of whylogs v1.1 brings many features to the whylogs data logging API, making it even easier to monitor your data and ML models!

Model Monitoring for Financial Fraud Classification

Model monitoring is helping the financial services industry avoid huge losses caused by performance degradation in their fraud transaction models.

Robust & Responsible AI Newsletter - Issue #3

Every quarter we send out a roundup of the hottest MLOps and Data-Centric AI news including industry highlights, what’s brewing at WhyLabs, and more.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo
loading...