blog bg left
Back to Blog

Data Validation at Scale – Detecting and Responding to Data Misbehavior

In today’s data-driven world, companies rely heavily on data to make informed decisions and gain a competitive edge. This also means that low-quality data can have serious negative consequences for businesses. Incorrect or incomplete data can lead to poor decision-making, missed opportunities, and ultimately, financial losses. Data validation can be particularly challenging as the amount of data involved continues to grow steadily. As businesses generate and collect more data than ever before, the task of ensuring that all of this information is accurate and consistent becomes increasingly complex.

In this tutorial, we’ll introduce the concept of data logging and discuss how to validate data at scale by creating metric constraints and generating reports based on the data’s statistical profiles using the whylogs open-source package.

Case Study: Airbnb Listings in Rio de Janeiro

In this tutorial, we will validate data containing Airbnb’s listing activity and metrics from Rio de Janeiro, Brazil. The used dataset was adapted from the inside Airbnb project. Let’s download the dataframe with:

import pandas as pd
df_target =
pd.read_parquet("https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples
/Listings/airbnb_listings_target.parquet")

Let’s simulate a scenario where we want to assert the quality of a batch of production data. We will define a set of data quality checks, or constraints, assuming the existence of previous domain knowledge and experience with the data. These constraints operate on top of statistical summaries of data, rather than on the raw data itself. In whylogs, these statistical summaries are called profiles, so let’s begin with a brief introduction to data logging.

Data logging with whylogs

In a production setting, we need ways of monitoring data that are scalable and efficient. For a number of reasons, such as storage requirements or privacy concerns, using raw data for debugging/monitoring purposes might not be feasible.

For this reason, we’ll leverage data logging to generate statistical summaries of our data, which we can then use to track changes in our dataset, ensure data quality and visualize key summary statistics.

First of all, we can install whylogs (with the viz extra, which we’ll use later):

pip install whylogs[viz]

Let’s first create a profile of our target dataframe:

import whylogs as why
results = why.log(df_target)
profile_view = results.profile().view()

A profile is a lightweight statistical fingerprint of your dataset, which can be stored for later use or sent over to monitoring platforms by generating a profile view. It will provide you with valuable statistics on a column (feature) basis, such as:

  • Counters, such as the number of samples and null values
  • Inferred types, such as integral, fractional, and boolean
  • Estimated Cardinality
  • Frequent Items
  • Distribution Metrics: min, max, median, quantile values

A profile can be used for several purposes, such as a) data monitoring, b) visualization, c) drift detection, and d) data validation. In the next session, we will see how to perform data validation with Metric Constraints.

Data Validation with Metric Constraints

Constraints are a powerful feature built on top of whylogs profiles that enable you to quickly and easily validate that your data looks the way that it should. There are numerous types of constraints that you can set on your data (that numerical data will always fall within a certain range, that text data will always be in a JSON format, etc) and, if your dataset fails to satisfy a constraint, you can fail your unit tests or your CI/CD pipeline.

There are a number of ways to create Metric Constraints. In this example, we will use out-of-the-box helper constraints to facilitate the process.

We will create the constraints with the help of ConstraintsBuilder. That will allow us to progressively add the constraints we wish to build:

from whylogs.core.constraints import ConstraintsBuilder
from whylogs.core.constraints.factories import (
    no_missing_values,
    is_in_range,
    smaller_than_number,
    quantile_between_range,
    is_non_negative,
    frequent_strings_in_reference_set,
    column_is_nullable_integral,
)
room_set = {"Private room", "Shared room", "Hotel room", "Entire home/apt"}
builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(no_missing_values(column_name="id"))
builder.add_constraint(is_in_range(column_name="latitude", lower=-24, upper=-22))
builder.add_constraint(is_in_range(column_name="longitude", lower=-44, upper=-43))
builder.add_constraint(smaller_than_number(column_name="availability_365", number=366))
builder.add_constraint(quantile_between_range(column_name="price", quantile=0.5, lower=150, upper=437))
builder.add_constraint(is_non_negative(column_name="bedrooms"))
builder.add_constraint(column_is_nullable_integral(column_name="bedrooms"))
builder.add_constraint(frequent_strings_in_reference_set(column_name="room_type", reference_set=room_set))
constraints = builder.build()
constraints.validate()

Calling validate() will return True if all the constraints pass, and False otherwise.

We can also visualize the constraints report with the viz module. With it, you can filter the displayed constraints by name or status (pass or fail), and, if you hover over each constraint’s status, it will provide you with additional context that was used to determine the constraint’s status:

from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints)

It looks like our data meets almost all of our assertions, with the exception of one – we should check the bedrooms column and see why its type is not the one we expect.

What’s Next

In this blog post, we have explored some of the capabilities of whylogs for data validation. However, it’s worth noting that there are a number of additional features within whylogs that we haven’t covered here.

For a more in-depth view of this topic, you can sign up for my upcoming workshop at ODSC Europe “Data Validation at Scale – Detecting and Responding to Data Misbehavior.” In the workshop, we will also see how to automatically generate constraints based on a reference dataset, row-level validation, triggering actions on failed conditions, and how to debug failed conditions.

Data Validation at Scale – Detecting and Responding to Data Misbehavior” was originally published by Open Data Science.

Other posts

Glassdoor Decreases Latency Overhead and Improves Data Monitoring with WhyLabs

The Glassdoor team describes their integration latency challenges and how they were able to decrease latency overhead and improve data monitoring with WhyLabs.

Understanding and Monitoring Embeddings in Amazon SageMaker with WhyLabs

WhyLabs and Amazon Web Services (AWS) explore the various ways embeddings are used, issues that can impact your ML models, how to identify those issues and set up monitors to prevent them in the future!

Data Drift Monitoring and Its Importance in MLOps

It's important to continuously monitor and manage ML models to ensure ML model performance. We explore the role of data drift management and why it's crucial in your MLOps pipeline.

Ensuring AI Success in Healthcare: The Vital Role of ML Monitoring

Discover how ML monitoring plays a crucial role in the Healthcare industry to ensure the reliability, compliance, and overall safety of AI-driven systems.

WhyLabs Recognized by CB Insights GenAI 50 among the Most Innovative Generative AI Startups

WhyLabs has been named on CB Insights’ first annual GenAI 50 list, named as one of the world’s top 50 most innovative companies developing generative AI applications and infrastructure across industries.

Hugging Face and LangKit: Your Solution for LLM Observability

See how easy it is to generate out-of-the-box text metrics for Hugging Face LLMs and monitor them in WhyLabs to identify how model performance and user interaction are changing over time.

7 Ways to Monitor Large Language Model Behavior

Discover seven ways to track and monitor Large Language Model behavior using metrics for ChatGPT’s responses for a fixed set of 200 prompts across 35 days.

Safeguarding and Monitoring Large Language Model (LLM) Applications

We explore the concept of observability and validation in the context of language models, and demonstrate how to effectively safeguard them using guardrails.

Robust & Responsible AI Newsletter - Issue #6

A quarterly roundup of the hottest LLM, ML and Data-Centric AI news, including industry highlights, what’s brewing at WhyLabs, and more.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo
loading...