blog bg left
Back to Blog

How to Validate Data Quality for ML Monitoring

Data quality is one of the most important considerations for machine learning applications—and it's one of the most frequently overlooked. If you're not careful, poor data quality can cause your pipeline and models to fail in production, which is not only undesirable, but also expensive.

In this post, we’ll explore why validating data quality is an essential step in the MLOps process and how to use the open source whylogs library to perform data quality monitoring in a Python environment. If you want to dive right into the code, check out the colab notebook.

Note: Although this post is focused on checking data quality specifically for ML monitoring, the techniques can be applied to any application that uses a data pipeline.

Sample of using whylogs constraints for data quality validation| Source: Author

Note: Some other crucial steps for a complete model monitoring solution include measuring data drift, concept drift, and model performance. We will be covering all these topics in separate posts!

What is data quality validation?

Data quality validation ensures data is structured and falls in the range expected for our data pipelines or applications. When collecting or using data it’s important to verify the quality to avoid unwanted machine learning behavior in production, such as errors or faulty prediction results.

For example, we may want to ensure our data doesn’t contain any empty or negative values before moving it along in the pipeline if our model does not expect those values.

There are plenty of ways to create bad data, which is why it is good practice to implement validation anywhere data could change in the pipeline. Let’s look at a few issues that could cause data quality problems.

External changes:

  • An input sensor is damaged causing bad data
  • Data was manually entered incorrectly
  • A camera is pointing in the wrong direction

Schema changes:

  • A software library outputs a different format than before

Pipeline bugs:

  • Anywhere!

Note: Data validation can also be used on machine learning training and evaluation datasets before training occurs.

Determining criteria for data quality itself can be a complex topic. A comprehensive post about how to think about data quality can be found here. The rest of this post will focus on how to implement validation on already defined criteria.

How to validate data quality in Python

Now that we’ve covered why validating data quality is a crucial step in the machine learning monitoring process, let's take a look at some ways we can implement it in a Python environment.

These examples use whylogs, an open source data logging library to create lightweight profiles that contain statistical summaries of data for measuring data quality, data drift, and model drift in any Python environment.

Profiles in whylogs are efficient, mergeable, and only contain essential summary statistics. This allows them to be used in almost any application, whether it requires processing data in batches or streams, or in sensitive industries like healthcare and finance. Read more about whylogs on github.

Source: Author

Installing whylogs in a python environment is easy. Just use `pip install whylogs`. In this case we’ll use `pip install whylogs[viz]` to install the additional visualization dependencies.

Checking data quality with metric constraints

The feature in whylogs for performing data validation is called constraints. Constraints can be set on any default whylog profile metrics, such as counts, types, distribution, and cardinality or on user defined custom metrics. If the dataset doesn’t meet the set criteria, it can fail a unit test or CI/CD pipeline.

The quick example below is for everyone thinking “Just give me the code already!”. But we’ll discuss it in more detail in the next sections.

# 1 Import whylogs & whylog constraints
import whylogs as why
from whylogs.core.constraints import (Constraints,
                                     ConstraintsBuilder,
                                     MetricsSelector,
                                     MetricConstraint)
 
# 2 Create whylog profile & initialize constraint builder
profile_view = why.log(df).view()
builder = ConstraintsBuilder(profile_view)
 
# 3 Define a constraint for validating data
builder.add_constraint(MetricConstraint(
   name="weight >= 0",
   condition=lambda x: x.min >= 0,
   metric_selector=MetricsSelector(metric_name='distribution',
                                   column_name='weight')
))
 
# 4 Build the constraints and return the report
constraints: Constraints = builder.build()
constraints.report()

Run this code in Google colab

This example creates a whylogs profile and checks to see if the min value in the weights column is above 0. Let’s break down what’s going on here:

  1. Import whylogs and constraints
  2. Create a whylogs profile (More on this in the next section)
  3. Build a constraint by assigning values to name, condition, and metric_selector:
    - Name can be any string to describe the constraint (this column value should be greater than 0)
    - Condition is a lambda expression based on metric conditionals (min value in should be great than 0)
    - Metric_selector assigns which type metric is being measured (distribution, count, type, etc)
  4. Build the constraints and return the report

The report will return a list with a tuple for each constraint. Such as, `[('col_name >= 0', 0, 1)]`. This can be read as `[('Name', Pass, Fail)]`. This example would indicate it has failed.

Now that we’ve had a rapid introduction to using whylogs constraints, let's dive deeper into all the details!

Creating a data profile with whylogs

First, let's create a data profile with whylogs and inspect it to understand the metrics we can use for data validation.

We’ll create a simple toy dataset to get started with this example.

# import whylogs and pandas
import whylogs as why
import pandas as pd
 
# Set to show all columns in dataframe
pd.set_option("display.max_columns", None)
 
# create object with toy data
data = {
   "animal": ["cat", "hawk", "snake", "cat", "mosquito"],
   "legs": [4, 2, 0, 4, None],
   "weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
}
 
#create dataframe from toy data
df = pd.DataFrame(data)

Create the whylogs profile

# create whylogs profile & view
results = why.log(df)
profile_view = results.view()

View the whylogs profile for summary metrics of the dataset (counts, types, cardinality, distribution, etc.).

# Display profile view in a dataframe
profile_view.to_pandas()

Now that we have a whylogs profile, we can explore distributions, perform data quality validation, or compare with other profiles to measure other model monitoring metrics like data drift.

Read more about the different metrics in the whylogs docs.

Using whylogs constraints to validate data quality

Let's start by creating a single constraint to check that the minimum value in the weight distribution of our dataset is above 0. This is similar to the example above.

# Import constraints from whylogs
from whylogs.core.constraints import (Constraints,
                                     ConstraintsBuilder,
                                     MetricsSelector,
                                     MetricConstraint)
 
builder = ConstraintsBuilder(profile_view)
 
# Define a constraint for validating data
builder.add_constraint(MetricConstraint(
   name="weight > 0",
   condition=lambda x: x.min > 0,
   metric_selector=MetricsSelector(metric_name='distribution',
                                   column_name='weight')
))
# Build the constraints and return the report
constraints: Constraints = builder.build()
constraints.report()

Now let’s add another constraint to make sure no null (empty) values are present in the legs column.

# Create another constraint
builder.add_constraint(MetricConstraint(
   name="Legs null == 0",
   condition=lambda x: x.null.value == 0,
   metric_selector=MetricsSelector(metric_name='counts',
                                   column_name='legs')
))
 
constraints: Constraints = builder.build()
constraints.report()

This will return `[('weight >= 0', 1, 0), ('Legs null == 0', 0, 1)]` showing that the first test passed and the second one failed.

We can also visualize a constraint report by using the whylogs notebook profile visualizer.

from whylogs.viz import NotebookProfileVisualizer
 
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)

This will return an easy-to-read visual report for the validation results.

Reading code is fine, but running code is fun! You can find these examples that are ready to run in a Google colab notebook. Try changing the toy dataset and constraint values.

Seeing it all together

Before moving on, let’s look at what the code looks like all in one place with a third constraint added.

import whylogs as why
import pandas as pd
from whylogs.core.constraints import (Constraints,
                                     ConstraintsBuilder,
                                     MetricsSelector,
                                     MetricConstraint)
 
# create whylog profile & initialize constraint builder
profile_view = why.log(df).view()
builder = ConstraintsBuilder(profile_view)
 
# define constraints for validating data
builder.add_constraint(MetricConstraint(
   name="weight >= 0",
   condition=lambda x: x.min >= 0,
   metric_selector=MetricsSelector(metric_name='distribution',
                                   column_name='weight')
))
 
builder.add_constraint(MetricConstraint(
   name="Legs null == 0",
   condition=lambda x: x.null.value == 0,
   metric_selector=MetricsSelector(metric_name='counts',
                                   column_name='legs')
))
 
 
builder.add_constraint(MetricConstraint(
   name="count >= 1000",
   condition=lambda x: x.n.value > 1000,
   metric_selector=MetricsSelector(metric_name='counts', column_name='animal')
))
 
# Build the constraints and return the report
constraints: Constraints = builder.build()
constraints.report()

This returns a list of three turples: `[('weight >= 0', 1, 0), ('Legs null == 0', 0, 1), ('count >= 1000', 0, 1)]`. Which shows the first test passes, second fails, and third fails.

Again, we can also use the visualizer to get a very readable output.

from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)

There you have it! Constraints similar to this can be set up at various spots in a pipeline to check data quality. Custom metrics in whylogs can also be used for virtually any type of data, making it easier to perform ML monitoring on complex unstructured data for computer vision or NLP.

Data quality validation is just one step that can be used for monitoring machine learning models. Other types of failures can happen over time without monitoring for other changes such as data drift and model drift (which can also be implemented in whylogs).

Machine learning monitoring over time

Checking data quality with validation using tools like whylogs is a great start to implementing ML monitoring in the MLOps pipeline, but many projects require monitoring data changes over time. This can be done by using an AI observatory platform. whylogs integrates easily with WhyLabs to do just that!

WhyLabs takes a unique privacy preserving approach to ML monitoring by using statistical profiles generated with whylogs to measure data and model health metrics, such as data drift, data quality, output variation, and model performance. Since whylogs profiles are efficient, mergeable, and only contain essential summary statistics, they can be used to monitor almost any application, whether it requires processing data in batches, streams, or in sensitive industries, such as healthcare and finance.

The data validation conclusion

The key takeaway from this post is that you need to make sure your data is reliable and accurate.  Data quality validation is an important step in machine learning monitoring, and thankfully, it's easy to implement with whylogs.

We’ve covered:

  • Why data quality validation is important for ML monitoring
  • How to implement data quality validation with whylogs
  • A briefly overview monitoring ML data over time

Start your ML monitoring journey

Sign-up for a WhyLabs account to monitor datasets and ML models for free, no credit card required. Or schedule a demo to learn more.

If you have any questions, reach out to me on the Robust & Responsible AI Slack community!

References:







Other posts

A Solution for Monitoring Image Data

A breakdown of how to monitor unstructured data such as images, the types of problems that threaten computer vision systems, and a solution for these challenges.

Small Changes for Big SQLite Performance Increases

A behind-the-scenes look at how the WhyLabs engineering team improved SQLite performance to make monitoring data and machine learning models faster and easier for whylogs users.

5 Ways to Inspect Data & Models with whylogs Profile Visualizer

Understand what’s happening in your data, identify and correct issues quickly, and maintain the quality and relevance of high-performing data and ML models with whylogs profile visualizer.

Visually Inspecting Data Profiles for Data Distribution Shifts

This short tutorial shows how to inspect data for distribution shift issues by comparing distribution metrics and applying statistical tests for drift values calculations.

Data Logging With whylogs

Users can detect data drift, prevent ML model performance degradation, validate the quality of their data, and more in a single, lightning-fast, easy-to-use package. The v1 release brings a simpler API, new data constraints, new profile visualizations, faster performance, and a usability refresh.

Choosing the Right Data Quality Monitoring Solution

In the second article in this series, we break down what to look for in a data quality monitoring solution, open source and Saas tools available, and how to decide on the best one for your organization.

A Comprehensive Overview Of Data Quality Monitoring

In the first article in this series, we provide a detailed overview of why data quality monitoring is crucial for building successful data and machine learning systems and how to approach it.

WhyLabs Now Available in AWS Marketplace

AWS customers worldwide can now quickly deploy the WhyLabs AI Observatory to monitor, understand, and debug their machine learning models deployed in AWS.

Deploying and Monitoring Made Easy with TeachableHub and WhyLabs

Deploying a model into production and maintaining its performance can be harrowing for many Data Scientists, especially without specialized expertise and equipment. Fortunately, TeachableHub and WhyLabs make it easy to get models out of the sandbox and into a production-ready environment.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo
loading...