WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

Felipe Adachi

Jun 28, 2022

Back to Blog

Visually Inspecting Data Profiles for Data Distribution Shifts

Whylogs
ML Monitoring
Open Source

Felipe Adachi

Jun 28, 2022

The real world is a constant source of ever-changing and non-stationary data. That ultimately means that even the best ML models will eventually go stale. Data distribution shifts, in all of their forms, are a major post-production concern for any ML or data practitioner.

Distribution shift issues, if unaddressed, can mean significant performance degradation over time and even turn the model downright unusable. In a production environment, where there are large volumes of often sensitive data, detecting and diagnosing these issues can become especially challenging.

Agenda

In this tutorial, we will see how we can inspect data for distribution shift issues by comparing distribution metrics and applying statistical tests for drift values calculations.

We’ll also learn how to leverage data logging to enable the processing of large volumes of data while addressing privacy concerns in order to inspect for data distribution shift issues in a production environment.

For this tutorial, you’ll need only a Python 3 environment, either on your own device or on a cloud device, such as a Google Colab Notebook.

But what is Data Distribution Shift anyway?

Unlike in traditional software development, the performance of your supervised machine learning model will degrade with time, which is known as model decay, or model degradation. One of the most common causes of model decay is due to changes in the distribution of data during production when compared to the data you used to test and validate your model. This can happen in different ways, such as changes in the distribution of input data, output data, or even changes in the relationship between the input and output.

Why is this a problem?

This is a problem because, ultimately, distribution shifts can affect the performance of your model, leading to all sorts of negative impacts on your organization. Ideally, your model’s performance should be constantly monitored. Still, it’s not always easy to have performance results readily available, because the required ground truth to do so might not be available or, if it is, it might come in a delayed fashion. In those cases, you can use the data you have as a signal or proxy for your model’s performance.

Let’s see a practical example of how we can inspect and detect distribution shifts with a simple case study.

Case Study: Covariate Shift with Wine Quality Dataset

As a case study, let’s use UCI’s Wine Quality Dataset. The goal of this task is to model wine quality based on physicochemical tests. This can be viewed as a classification task, where we predict the wine’s quality based on its features, like pH, density, and percent alcohol content.

In order to create a scenario of distribution shift, we will split the available dataset into two groups: wines with alcohol content (alcohol feature) below and above 11. The first group is considered our baseline (or reference) dataset, while the second will be our target dataset. This is an example of a covariate shift, one of the possible types of data distribution shifts, and it means that the input distribution changes between our reference and profile datasets, but the relationship between the input and output doesn’t change. Since we’re only concerned about changes in the input data, we’ll skip the model training altogether and focus on the input features.

The example used here was inspired by the article A Primer on Data Drift. If you’re interested in more information on this use case, or the theory behind Data Drift, it’s a great read!

Loading the dataframes

Let’s first download the dataframes. They are already preprocessed and split into target and reference dataframes.

import pandas as pd
target_url =
"https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/wine_target.csv"
reference_url =
"https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/wine_reference.csv"
target_df = pd.read_csv(target_url)
reference_df = pd.read_csv(reference_url)

Data logging with whylogs

In a production setting, we need ways of monitoring data that are scalable and efficient. For a number of reasons, such as storage requirements or privacy concerns, using raw data for debugging/monitoring purposes might not be feasible.

For this reason, we’ll leverage data logging to generate statistical summaries of our data, which we can then use to track changes in our dataset, ensure data quality and visualize key summary statistics. In whylogs, these statistical summaries are called profiles, which we’ll use to visualize the effect of covariate shift in our data.

First of all, we can install whylogs:

pip install whylogs

Let’s first create a profile of our target dataframe:

import whylogs as why
results = why.log(target_df)
profile = results.profile()

We can keep updating the profile by logging additional data, but for the moment let’s generate a Profile View from it, in order to continue our inspection:

profile_view = profile.view()

The profile_view is a lightweight statistical fingerprint of your dataset, which can be stored for later use or sent over to monitoring platforms. It will provide you with valuable statistics on a column (feature) basis, such as:

Counters, such as number of samples and null values
Inferred types, such as integral, fractional, and boolean
Estimated Cardinality
Frequent Items
Distribution Metrics: min, max, median, quantile values

target_summary = profile_view.to_pandas()
target_summary

Let’s do the same for our reference dataframe:

result_ref = why.log(pandas=reference_df)
profile_reference = result_ref.profile()
profile_view_reference = profile_reference.view()

There are a lot of other exciting features of whylogs that are out of scope for this short tutorial but are worth mentioning. One of these is the fact the generated profiles are mergeable, which means that the profiles produced can be combined with other profiles. In streaming systems, profiles can be captured over a mini-batch, and merged into different time granularities of data without losing statistical accuracy. This is made possible with a technique called data sketching, pioneered by Apache DataSketches. If you want to know more about this and other aspects of whylogs, feel free to check out our open-source repository!

Inspecting and comparing distributions with the Profile Visualizer

There are a number of ways we could inspect our profiles for data shifts and other data quality issues. The first one is to simply compare distribution metrics like mean, median or quantile values, which could be done with the Profile View obtained in the previous section. The downside of this approach is that there might be real cases of data shift that are not perceived by simply inspecting these metrics.

There are some other approaches we can take. Let’s go through them by using whylogs’ visualization module, the Notebook Profile Visualizer.

To do so, we can start by instantiating a visualizer, and setting the target and reference profiles obtained in the previous section:

from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=profile_view,
reference_profile_view=profile_view_reference)

Applying Statistical Tests

Instead of simply comparing distribution metrics, we can quantitatively measure data shift (or drift) by applying two-sample hypothesis testing. We can use these tests in order to compare two different sets of data to verify if both come from a common underlying distribution.

There are a number of different methods for this purpose. In this tutorial, we will use two of the most popular ones: K-S test (for numerical features) and chi-squared test (for categorical features). We can do so by simply generating a Summary Drift Report, which will yield the p-values for all of the common features between distributions, alongside other useful information, such as overall metrics, side-by-side histograms, and distribution metrics:

visualization.summary_drift_report()

The null hypothesis is that the samples are drawn from the same distribution, which means that a low p-value is indicative of different distributions. In this example, we see that drift was detected for all of our features.

What’s Next

In addition to statistical tests, there are other approaches you can take to tackle distribution shifts, such as visually inspecting histograms and distribution charts for individual features, which can be useful to confirm the disparity between distributions. In a more general topic, setting rule-based data validation is key in ensuring the quality of your data, which includes distribution changes, be it from external factors or systemic errors such as pipeline errors or missing data.

References

“Visually Inspecting Data Profiles for Data Distribution Shifts” was originally published by Open Data Science.

Felipe Adachi

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

How to Evaluate and Improve RAG Applications for Safe Production Deployment

Rich Young

Jul 17, 2024

Learn how to evaluate and improve RAG applications using LangKit and WhyLabs AI Control Center. Develop secure and reliable RAG applications.

Read post

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

Visually Inspecting Data Profiles for Data Distribution Shifts

Agenda

But what is Data Distribution Shift anyway?

Why is this a problem?

Case Study: Covariate Shift with Wine Quality Dataset

Loading the dataframes

Data logging with whylogs

Inspecting and comparing distributions with the Profile Visualizer

Applying Statistical Tests

What’s Next

References

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

How to Evaluate and Improve RAG Applications for Safe Production Deployment

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty

About

Resources

whylogs

WhyLabs