Data Validation at Scale – Detecting and Responding to Data Misbehavior
- Whylogs
- Open Source
Jun 6, 2023
In today’s data-driven world, companies rely heavily on data to make informed decisions and gain a competitive edge. This also means that low-quality data can have serious negative consequences for businesses. Incorrect or incomplete data can lead to poor decision-making, missed opportunities, and ultimately, financial losses. Data validation can be particularly challenging as the amount of data involved continues to grow steadily. As businesses generate and collect more data than ever before, the task of ensuring that all of this information is accurate and consistent becomes increasingly complex.
In this tutorial, we’ll introduce the concept of data logging and discuss how to validate data at scale by creating metric constraints and generating reports based on the data’s statistical profiles using the whylogs open-source package.
Case Study: Airbnb Listings in Rio de Janeiro
In this tutorial, we will validate data containing Airbnb’s listing activity and metrics from Rio de Janeiro, Brazil. The used dataset was adapted from the inside Airbnb project. Let’s download the dataframe with:
import pandas as pd
df_target =
pd.read_parquet("https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples
/Listings/airbnb_listings_target.parquet")
Let’s simulate a scenario where we want to assert the quality of a batch of production data. We will define a set of data quality checks, or constraints, assuming the existence of previous domain knowledge and experience with the data. These constraints operate on top of statistical summaries of data, rather than on the raw data itself. In whylogs, these statistical summaries are called profiles, so let’s begin with a brief introduction to data logging.
Data logging with whylogs
In a production setting, we need ways of monitoring data that are scalable and efficient. For a number of reasons, such as storage requirements or privacy concerns, using raw data for debugging/monitoring purposes might not be feasible.
For this reason, we’ll leverage data logging to generate statistical summaries of our data, which we can then use to track changes in our dataset, ensure data quality and visualize key summary statistics.
First of all, we can install whylogs (with the viz extra, which we’ll use later):
pip install whylogs[viz]
Let’s first create a profile of our target dataframe:
import whylogs as why
results = why.log(df_target)
profile_view = results.profile().view()
A profile is a lightweight statistical fingerprint of your dataset, which can be stored for later use or sent over to monitoring platforms by generating a profile view. It will provide you with valuable statistics on a column (feature) basis, such as:
- Counters, such as the number of samples and null values
- Inferred types, such as integral, fractional, and boolean
- Estimated Cardinality
- Frequent Items
- Distribution Metrics: min, max, median, quantile values
A profile can be used for several purposes, such as a) data monitoring, b) visualization, c) drift detection, and d) data validation. In the next session, we will see how to perform data validation with Metric Constraints.
Data Validation with Metric Constraints
Constraints are a powerful feature built on top of whylogs profiles that enable you to quickly and easily validate that your data looks the way that it should. There are numerous types of constraints that you can set on your data (that numerical data will always fall within a certain range, that text data will always be in a JSON format, etc) and, if your dataset fails to satisfy a constraint, you can fail your unit tests or your CI/CD pipeline.
There are a number of ways to create Metric Constraints. In this example, we will use out-of-the-box helper constraints to facilitate the process.
We will create the constraints with the help of ConstraintsBuilder. That will allow us to progressively add the constraints we wish to build:
from whylogs.core.constraints import ConstraintsBuilder
from whylogs.core.constraints.factories import (
no_missing_values,
is_in_range,
smaller_than_number,
quantile_between_range,
is_non_negative,
frequent_strings_in_reference_set,
column_is_nullable_integral,
)
room_set = {"Private room", "Shared room", "Hotel room", "Entire home/apt"}
builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(no_missing_values(column_name="id"))
builder.add_constraint(is_in_range(column_name="latitude", lower=-24, upper=-22))
builder.add_constraint(is_in_range(column_name="longitude", lower=-44, upper=-43))
builder.add_constraint(smaller_than_number(column_name="availability_365", number=366))
builder.add_constraint(quantile_between_range(column_name="price", quantile=0.5, lower=150, upper=437))
builder.add_constraint(is_non_negative(column_name="bedrooms"))
builder.add_constraint(column_is_nullable_integral(column_name="bedrooms"))
builder.add_constraint(frequent_strings_in_reference_set(column_name="room_type", reference_set=room_set))
constraints = builder.build()
constraints.validate()
Calling validate() will return True if all the constraints pass, and False otherwise.
We can also visualize the constraints report with the viz module. With it, you can filter the displayed constraints by name or status (pass or fail), and, if you hover over each constraint’s status, it will provide you with additional context that was used to determine the constraint’s status:
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints)
It looks like our data meets almost all of our assertions, with the exception of one – we should check the bedrooms column and see why its type is not the one we expect.
What’s Next
In this blog post, we have explored some of the capabilities of whylogs for data validation. However, it’s worth noting that there are a number of additional features within whylogs that we haven’t covered here.
For a more in-depth view of this topic, you can sign up for my upcoming workshop at ODSC Europe “Data Validation at Scale – Detecting and Responding to Data Misbehavior.” In the workshop, we will also see how to automatically generate constraints based on a reference dataset, row-level validation, triggering actions on failed conditions, and how to debug failed conditions.
“Data Validation at Scale – Detecting and Responding to Data Misbehavior” was originally published by Open Data Science.
Other posts
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI