How to Validate Data Quality for ML Monitoring
- ML Monitoring
- Data Quality
- WhyLabs
- Whylogs
Jul 27, 2022
Data quality is one of the most important considerations for machine learning applications—and it's one of the most frequently overlooked. If you're not careful, poor data quality can cause your pipeline and models to fail in production, which is not only undesirable, but also expensive.
In this post, we’ll explore why validating data quality is an essential step in the MLOps process and how to use the open source whylogs library to perform data quality monitoring in a Python environment. If you want to dive right into the code, check out the colab notebook.
Note: Although this post is focused on checking data quality specifically for ML monitoring, the techniques can be applied to any application that uses a data pipeline.
Note: Some other crucial steps for a complete model monitoring solution include measuring data drift, concept drift, and model performance. We will be covering all these topics in separate posts!
What is data quality validation?
Data quality validation ensures data is structured and falls in the range expected for our data pipelines or applications. When collecting or using data it’s important to verify the quality to avoid unwanted machine learning behavior in production, such as errors or faulty prediction results.
For example, we may want to ensure our data doesn’t contain any empty or negative values before moving it along in the pipeline if our model does not expect those values.
There are plenty of ways to create bad data, which is why it is good practice to implement validation anywhere data could change in the pipeline. Let’s look at a few issues that could cause data quality problems.
External changes:
- An input sensor is damaged causing bad data
- Data was manually entered incorrectly
- A camera is pointing in the wrong direction
Schema changes:
- A software library outputs a different format than before
Pipeline bugs:
- Anywhere!
Note: Data validation can also be used on machine learning training and evaluation datasets before training occurs.
Determining criteria for data quality itself can be a complex topic. A comprehensive post about how to think about data quality can be found here. The rest of this post will focus on how to implement validation on already defined criteria.
How to validate data quality in Python
Now that we’ve covered why validating data quality is a crucial step in the machine learning monitoring process, let's take a look at some ways we can implement it in a Python environment.
These examples use whylogs, an open source data logging library to create lightweight profiles that contain statistical summaries of data for measuring data quality, data drift, and model drift in any Python environment.
Profiles in whylogs are efficient, mergeable, and only contain essential summary statistics. This allows them to be used in almost any application, whether it requires processing data in batches or streams, or in sensitive industries like healthcare and finance. Read more about whylogs on github.
Installing whylogs in a python environment is easy. Just use pip install whylogs
. In this case we’ll use pip install whylogs[viz]
to install the additional visualization dependencies.
Checking data quality with metric constraints
The feature in whylogs for performing data validation is called constraints. Constraints can be set on any default whylog profile metrics, such as counts, types, distribution, and cardinality or on user defined custom metrics. If the dataset doesn’t meet the set criteria, it can fail a unit test or CI/CD pipeline.
The quick example below is for everyone thinking “Just give me the code already!”. But we’ll discuss it in more detail in the next sections.
This example creates a whylogs profile and checks to see if the min value in the weights column is above 0. Let’s break down what’s going on here:
- Import whylogs and constraints
- Create a whylogs profile (More on this in the next section)
- Build a constraint by assigning values to name, condition, and metric_selector:
- Name can be any string to describe the constraint (this column value should be greater than 0)
- Condition is a lambda expression based on metric conditionals (min value in should be great than 0)
- Metric_selector assigns which type metric is being measured (distribution, count, type, etc) - Build the constraints and return the report
The report will return a list with a tuple for each constraint. Such as, [('col_name > 0', 0, 1)]
. This can be read as [('Name', Pass, Fail)]
. This example would indicate it has failed.
Now that we’ve had a rapid introduction to using whylogs constraints, let's dive deeper into all the details!
Creating a data profile with whylogs
First, let's create a data profile with whylogs and inspect it to understand the metrics we can use for data validation.
We’ll create a simple toy dataset to get started with this example.
# import whylogs and pandas
import whylogs as why
import pandas as pd
# Set to show all columns in dataframe
pd.set_option("display.max_columns", None)
# create object with toy data
data = {
"animal": ["cat", "hawk", "snake", "cat", "mosquito"],
"legs": [4, 2, 0, 4, None],
"weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
}
#create dataframe from toy data
df = pd.DataFrame(data)
Create the whylogs profile
# create whylogs profile & view
results = why.log(df)
profile_view = results.view()
View the whylogs profile for summary metrics of the dataset (counts, types, cardinality, distribution, etc.).
# Display profile view in a dataframe
profile_view.to_pandas()
Now that we have a whylogs profile, we can explore distributions, perform data quality validation, or compare with other profiles to measure other model monitoring metrics like data drift.
Read more about the different metrics in the whylogs docs.
Using whylogs constraints to validate data quality
Let's start by creating a single constraint to check that the minimum value in the weight distribution of our dataset is above 0. This is similar to the example above.
# Import constraints from whylogs
from whylogs.core.constraints import (Constraints,
ConstraintsBuilder,
MetricsSelector,
MetricConstraint)
builder = ConstraintsBuilder(profile_view)
# Define a constraint for validating data
builder.add_constraint(MetricConstraint(
name="weight > 0",
condition=lambda x: x.min > 0,
metric_selector=MetricsSelector(metric_name='distribution',
column_name='weight')
))
# Build the constraints and return the report
constraints: Constraints = builder.build()
constraints.report()
Now let’s add another constraint to make sure no null (empty) values are present in the legs column.
# Create another constraint
builder.add_constraint(MetricConstraint(
name="Legs null == 0",
condition=lambda x: x.null.value == 0,
metric_selector=MetricsSelector(metric_name='counts',
column_name='legs')
))
constraints: Constraints = builder.build()
constraints.report()
This will return [('weight >= 0', 1, 0), ('Legs null == 0', 0, 1)]
showing that the first test passed and the second one failed.
We can also visualize a constraint report by using the whylogs notebook profile visualizer.
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)
This will return an easy-to-read visual report for the validation results.
Reading code is fine, but running code is fun! You can find these examples that are ready to run in a Google colab notebook. Try changing the toy dataset and constraint values.
Seeing it all together
Before moving on, let’s look at what the code looks like all in one place with a third constraint added.
import whylogs as why
import pandas as pd
from whylogs.core.constraints import (Constraints,
ConstraintsBuilder,
MetricsSelector,
MetricConstraint)
# create whylog profile & initialize constraint builder
profile_view = why.log(df).view()
builder = ConstraintsBuilder(profile_view)
# define constraints for validating data
builder.add_constraint(MetricConstraint(
name="weight >= 0",
condition=lambda x: x.min >= 0,
metric_selector=MetricsSelector(metric_name='distribution',
column_name='weight')
))
builder.add_constraint(MetricConstraint(
name="Legs null == 0",
condition=lambda x: x.null.value == 0,
metric_selector=MetricsSelector(metric_name='counts',
column_name='legs')
))
builder.add_constraint(MetricConstraint(
name="count >= 1000",
condition=lambda x: x.n.value > 1000,
metric_selector=MetricsSelector(metric_name='counts', column_name='animal')
))
# Build the constraints and return the report
constraints: Constraints = builder.build()
constraints.report()
This returns a list of three turples: [('weight >= 0', 1, 0), ('Legs null == 0', 0, 1), ('count >= 1000', 0, 1)]
. Which shows the first test passes, second fails, and third fails.
Again, we can also use the visualizer to get a very readable output.
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)
There you have it! Constraints similar to this can be set up at various spots in a pipeline to check data quality. Custom metrics in whylogs can also be used for virtually any type of data, making it easier to perform ML monitoring on complex unstructured data for computer vision or NLP.
Data quality validation is just one step that can be used for monitoring machine learning models. Other types of failures can happen over time without monitoring for other changes such as data drift and model drift (which can also be implemented in whylogs).
Machine learning monitoring over time
Checking data quality with validation using tools like whylogs is a great start to implementing ML monitoring in the MLOps pipeline, but many projects require monitoring data changes over time. This can be done by using an AI observatory platform. whylogs integrates easily with WhyLabs to do just that!
WhyLabs takes a unique privacy preserving approach to ML monitoring by using statistical profiles generated with whylogs to measure data and model health metrics, such as data drift, data quality, output variation, and model performance. Since whylogs profiles are efficient, mergeable, and only contain essential summary statistics, they can be used to monitor almost any application, whether it requires processing data in batches, streams, or in sensitive industries, such as healthcare and finance.
The data validation conclusion
The key takeaway from this post is that you need to make sure your data is reliable and accurate. Data quality validation is an important step in machine learning monitoring, and thankfully, it's easy to implement with whylogs.
We’ve covered:
- Why data quality validation is important for ML monitoring
- How to implement data quality validation with whylogs
- A briefly overview monitoring ML data over time
Start your ML monitoring journey
Sign-up for a WhyLabs account to monitor datasets and ML models for free, no credit card required. Or schedule a demo to learn more.
If you have any questions, reach out to me on the Robust & Responsible AI Slack community!
References:
- Data Validation with whylogs Colab Notebook
- A Comprehensive Overview Of Data Quality Monitoring
- Constraint examples with whylogs
- whylogs github
- WhyLabs - free sign-up
Other posts
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI