blog bg left
Back to Blog

5 Ways to Inspect Data & Models with whylogs Profile Visualizer

Spot Data Drift and Ensure Machine Learning Model Quality

This blog was originally published in Towards Data Science by Kathryn Hurchla.

The open-source whylogs profile visualizer enables visual observability of data in machine learning and data engineering with Python. The addition of visualization to data observability further distills and boosts the signals that can be captured in whylogs data logging profiles to help you understand what’s happening in your data, identify and correct issues quickly, and maintain the quality and relevance of high-performing data and machine learning models. The profile visualizer runs right in a Jupyter Notebook along with your logging code and grows the interpretative capabilities of the open-source whylogs library quickly becoming the standard for data logging and monitoring of ML models and data pipelines.

As a member of the Robust & Responsible AI Community on Slack, and being pretty vocal about my passion for visual analytics, as usual, I was introduced to the profile visualizer there by members of the WhyLabs team as it was being developed. This further sparked my curiosity about how visualization fits in with closing a gap that exists in data observability once a dataset or model is in production — watching how it’s behaving out in the world.

Felipe Adachi wrote the example code notebook that’s been the basis for me to learn to use the profile visualizer, which I’ve highlighted excerpts from in this post and extended with additional examples you can try from my own notebook. Both complement this post step by step. Felipe and Danny Leybzon both generously answered my questions and provided feedback as I wrote this article from their perspective as maintainers of whylogs and the profile visualizer and for accuracy with the developing v1 code base that was just released.

Learn to effectively use the profile visualizer

  • to examine data logged at key input and output junctures in your pipeline
  • to spot differences in the distributions of two data profiles at a glance
  • to quickly pinpoint data drift in your machine learning operations

Get started!

Review of Logging and Statistical Profiles

Briefly, the value of data logging and how whylogs profiles work are fundamental to appreciate before diving into the profile visualizer. Data pipelines today, with the integration of machine learning and artificial intelligence, flow with unsurpassed acceleration and at a scale that can make keeping an eye on your data difficult. For the same reasons monitoring your data is more important than ever to maintain its integrity.

The first step is capturing logs close to your data workflow and at the points where data enters or exits the pipeline or goes through major transformations that could introduce or highlight issues, such as missing values or changes in data types upstream that have not been accounted for. In order to use the profile visualizer, this means logging data profiles with Python.

Images by cottonbro via Pexels modified to combine as permitted under license to the original author.

The profile visualizer can be used with any kind of data you capture in whylogs profiles to detect different drift-causing scenarios and other data quality and model performance issues. That includes structured or unstructured data, and batch or streaming data. The whylogs profiles are lightweight summaries of your data that are well-calibrated to the scope and degree of approximation appropriate for data monitoring, no more and no less than necessary. Of particular relevance to monitoring and visualization are the lightweight nature of whylogs profiles and their greater sensitivity to outliers as compared to traditional sampling methods. Rare events that could indicate a quality issue are easy to spot.

Logging data in a way that’s more specific than necessary however could easily become a resource drag on your team and infrastructure and a new form of technical debt. Choosing to log these aggregate statistical profiles instead of storing even small samples of detailed data could also save you the risk of hefty legal fines with heightened regulations around data ownership and privacy.

How to Visually Examine Statistics of a Single Profile

First, see how any feature of a single whylogs data profile looks with turnkey feature statistics. Performing this kind of univariate analysis, or analysis of one feature, is a quick way to describe your data during exploratory analysis, anticipate patterns, and understand key aspects like cardinality, central tendency, dispersion, variability, and the occurrence of missing values.

A Jupyter Notebook containing full code for this and other examples in this post can be opened in Colab or downloaded or forked with other examples in the same GitHub repository. I refer to data features throughout this post because I approach the visualizer more from data science than engineering. Whether you call these data variables, features, or columns of data you can use these charts and reports to get eyes on your data and get ahead of issues.

Prepare a Notebook for profile visualizer

First, install the recently released whylogs v1 package along with the viz package to use the profile visualizer. I use a macOS machine running the zsh shell language by default in its Terminal, so running the first code cell below does the job for me. If you have previously installed whylogs, be sure to include the --upgrade option because the profile visualizer is available only since the recent whylogs v1 release.

Install Dependencies

To use the profile visualizer, install whylogs v1 with the extra package viz.

You can install it directly in your notebook by prepending the terminal/shell command with ! exclamation point.

# in a notebook, if you are using zsh, 
# escape command in hard brackets
!pip install -q --upgrade whylogs\[viz\]
# # or directly in a terminal/shell window with zsh
# pip install --upgrade whylogs\[viz\]
# # or in a notebook with bash
# !pip install -q --upgrade whylogs[viz]
# # or directly in a terminal/shell window with bash
# pip install --upgrade whylogs[viz]

This code gist also shows the install command for a few different scenarios. Your operating system and where you choose to run the install command impacts which syntax will work for you. With the Z shell (Zsh), it’s necessary to escape the viz package in hard brackets to prevent it from being interpreted as a pattern. Install packages directly in your Jupyter Notebook by prepending the command with an exclamation point, in which case try adding the -q quiet command option if you don’t want to see all of the automatic output from the installation process.

Load Data for the Examples

An example of comparing two profile views from a well-known machine learning dataset is where things get really interesting with profile visualizer! Load the data then print a concise summary of the pandas dataframe.

import pandas as pd
pd.options.mode.chained_assignment = None  # Disabling false positive warning

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine = pd.read_csv(url,sep=";")
wine.info()

The output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

Notice that the quality feature contains integer values. You will transform that into a categorical data type during this demo to make quick charts to see your data through a lens of ‘good’ or ‘bad’.

Initialize the profile visualizer

After you have imported the Python libraries you need for this example, load the wine quality dataset, log your dataframe in a whylogs profile, and generate a profile view to use with the profile visualizer. Then, instantiate `NotebookProfileViewer` and set your target profile. Most of the code featured in this post is also available in a single Jupyter Notebook. Select the Open in Colab button there or download the notebook directly.

from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=prof_view, reference_profile_view=prof_view_ref)
profile_visualizer_initialize hosted by GitHub

Once that’s done, a single line of code gets you a variety of useful summary statistics, as shown in this example based on the citric acid feature.

visualization.feature_statistics(feature_name="citric acid", profile="target")
Statistics describe your data. Each type of summary is a complement to the others. Image by the original author.

Feature statistics are organized with a top-level band of the most basic but still highly informative data aggregations. The percentage of values that are distinct can indicate a degree of consistency (high %) or inconsistency (low %). Here, you see that only 18% of citric acid values in the Target profile are distinct.

Quantile statistics speak to how your data are distributed across the range of values that occur. The Interquartile range of citric acid in this portion of the UCI Wine dataset of 0.3 is low relative to 1.0. If your mental image of the data, the middle 50% of your values should then be clustered closer to each other around the median. Conversely, a higher IQR such as 0.8 would show values spread more widely across the range.

Among the Descriptive statistics, the Coefficient of variation is the Standard deviation divided by the Mean. As a ratio, it’s independent of the unit scale of its feature and can be compared to the CV of another feature in the data as long as both features have continuous numerical scales that start with zero. For example, a 0.72 or 72% CV of citric acid indicates widely dispersed values. If you compare that to another feature’s CV of .30, for example, you learn that citric acid is more widely dispersed than the other feature. However, the CV equation is sensitive to small differences in the Mean when it nears zero, and in those cases, the Standard deviation itself is more informative, so it’s important to consider all of these statistics in context.

How to Compare Two Data Profiles

With the profile visualizer comes exciting improvements to working with two whylogs profiles for easy visual comparisons directly in a notebook. The next steps walk you through running a Summary Drift Report for an overview of statistics about all dataset features and the differences in each attribute that represent the amount of drift between a target and a reference profile. Then, short demos illustrate the chart types that are available with the visualizer to focus on a single numerical or categorical feature across two profiles, and when each chart is most useful.

Monitoring your data against a baseline target, and being able to see it, is akin to a crystal ball when you haven’t had that clarity! Data quality can deteriorate once a machine learning model is moved into production, and it’s also critical to monitor the performance of a model to detect drift or a change in the relationships between the model’s input and output data.

One cause of data drift in machine learning that the visualizer can help you uncover is selection bias, which is introduced when a model is trained on data that is not representative of the entire population it’s intended to perform on. The example below illustrates this common scenario by using biased criteria for splitting the dataset into two profiled groups. However, To understand different root causes of drift and when each might show up in the pattern of your model data, A Primer on Data Drift is an excellent read.

Initialize Target and Reference Profiles in the profile visualizer

Now, split the dataset in this case study of wine quality into two groups to intentionally create a sample selection bias scenario where the training sample is not representative of the population. Load the wine quality dataset as before. The first group will include wines with alcohol content at or below 11 and will be considered your baseline (or reference) dataset, and the second group will include wines with an alcohol content above 11 as your target dataset.

cond_reference = (wine['alcohol']<=11)
wine_reference = wine.loc[cond_reference]

cond_target = (wine['alcohol']>11)
wine_target = wine.loc[cond_target]
profile_visualizer_split_wines hosted by GitHub

Prepare the two groups, and then pass the whylogs profile for each to the profile visualizer for comparisons between a target and reference profile. The quality feature is a numerical one, representing the wine's quality. Transform it to a categorical feature, where each wine is classified as Good or Bad. Anything above 6.5 is a good wine. Otherwise, it's bad.

# add some missing values to a feature to see how they
ixs = wine.iloc[100:110].index
wine.loc[ixs,'citric acid'] = None

bins = (2, 6.5, 8)
group_names = ['bad', 'good']

wine_reference['quality'] = pd.cut(wine_reference['quality'], bins = bins, labels = group_names)
wine_target['quality'] = pd.cut(wine_target['quality'], bins = bins, labels = group_names)
profile_visualizer_transform_quality hosted by GitHub

Now, profile the dataframes with whylogs, and create profile_views as arguments to feed into the NotebookProfileVisualizer.

import whylogs as why
result = why.log(pandas=wine_target)
prof_view = result.view()

result_ref = why.log(pandas=wine_reference)
prof_view_ref = result_ref.view()

from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=prof_view, reference_profile_view=prof_view_ref)
whylogs_profile_views.py hosted by GitHub

If this seems like a lot of setting up, then you’re in luck because from here on out all it takes to get a variety of relevant charts and reports are single lines of code.

View a Summary Drift Report of Your Profiles

Compare many features of the target and reference profiles side by side all at once. In a Summary Drift Report, you will see overview statistics, such as the number of observations and missing cells, as well as comparisons of the distributions of each feature. Drift for each specific numerical or categorical feature is calculated for you in the report, and alerts will be displayed related to the drift severity of each feature.

You only need to enter a single line of code to run a Summary Drift Report on the profiles that you already passed to the profile visualizer when you instantiated it in the step above.

visualization.summary_drift_report()

A Summary Drift Report of the wine quality example data profiles showing some of the histograms it contains for individual features, along with the differences, counts, and means for each. Video by the author.

Note that there is a search input box in the upper right corner of the summary drift report, where you can also search for a specific data feature, such as ‘quality’ or filter by inferred type. The code gist below shows the code from start to finish to make a summary drift report.

Compare Wine Quality in a Summary Drift Report

# Load the data
import pandas as pd
pd.options.mode.chained_assignment = None  # Disabling false positive warning

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine = pd.read_csv(url,sep=";")

# Split the wines into two groups: 
# with alcohol below 11 as the reference sample
cond_reference = (wine['alcohol']<=11)
wine_reference = wine.loc[cond_reference]
# with alcohol above 11 as the target dataset
cond_target = (wine['alcohol']>11)
wine_target = wine.loc[cond_target]

# Add some missing values to `citric acid`, to see how this is reflected in profile visualizer
ixs = wine.iloc[100:110].index
wine.loc[ixs,'citric acid'] = None

# Transform the numeric `quality` feature to a categorical feature, 
# where each wine is classified as Good (above 6.5) or Bad
bins = (2, 6.5, 8)
group_names = ['bad', 'good']

wine_reference['quality'] = pd.cut(wine_reference['quality'], bins = bins, labels = group_names)
wine_target['quality'] = pd.cut(wine_target['quality'], bins = bins, labels = group_names)

# Profile the dataframes with whylogs
import whylogs as why
result = why.log(pandas=wine_target)
# Create profile_views as arguments to feed into the NotebookProfileVisualizer
prof_view = result.view()
# Repeat for the reference profile
result_ref = why.log(pandas=wine_reference)
prof_view_ref = result_ref.view()

# Instantiate NotebookProfileViewer
from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
# Set the target and reference profiles
visualization.set_profiles(target_profile_view=prof_view, reference_profile_view=prof_view_ref)

# Run a summary_drift_report
visualization.summary_drift_report()

Overlay Profiles in a Double Histogram for Numerical Features

Differences in a single numerical feature can be seen at a high level quickly using histograms, which provide an approximate picture of data distribution and frequency. When working with a critical integer or float data feature, the Double Histogram in the profile visualizer is the chart where you could pre-emptively catch unanticipated variability, for example, while monitoring a machine learning model in production, and use that knowledge to reassess your data collection and selection choices all the way back through your pipeline (or as far back as necessary until you find the root cause).

Histograms of your current Target and Reference profiles are displayed in a single graph figure with pre-set color encodings and opacity levels that effectively distinguish the two profiles and result in a blended tone where the data layers overlap. Bin sizes for the histograms were assigned earlier when you initialized the profile visualizer. That dynamic binning ensures a common scale, interval ranges, and the fidelity of the comparisons you make in interpreting the charts.

Focus on the `density` feature from both data profiles in this example. From this succinct chart, you can assess the similarity in `density` in two profiles, see in what range of values the two datasets intersect, and gauge by how much the mean values differ at a glance. With a single line of code again, a double histogram can be made with your reference data overlaid on your target data.

visualization.double_histogram(feature_name="density")
Histograms show alignment — or differences — among distribution, frequency, mean, and outliers of two profiles all at once. Image by the original author.

Immediately, a large shift emerges between our Target and Reference data for `density`. The change can be seen in a few ways, although both the blue Target bars and the orange Reference bars both still skew somewhat to the right.

  • The distribution of the values on the x-axis has shrunk overall, from Target sprawling nearly the span of the axis to Reference clustering closer to its mean, with less outside 1–2 standard deviation if you were to calculate that.
  • Frequency is now more normalized in the Reference; whereas more randomness appeared in the Target.
  • Nearly 100 more data points fall in the mean of the Reference than did in the Target’s mean.

In our example, this is not surprising because the two profiles were intentionally split to illustrate a selection bias, by putting wines with low alcohol content in one data profile and wines with high alcohol in the other. Quite clearly, that decision with training data selection would result in poor performance if such a model met with wines outside the alcohol range it was trained on and would need to be re-trained.

Imagine a real-world scenario when a model is in production. If the Target in this example is a profile of input data logged when the model was first deployed into production, and the Reference is a profile that was logged at the same input point but at a later time, then your expectation with observability is that the two profiles, and any others logged at that point, will look the same or show only an acceptable degree of difference. In other words, a shift this noticeable indicates a likelihood of an issue with the data being fed into your model, and as a result any resulting output as well.

See Distribution of Categorical Features

In a distribution chart, the differences in categorical data elements between your two profiles will be clear. Using this chart based on the ‘quality’ feature you created during the data preparation steps, you will see how much of each wine data profile is ‘Good’ and how much is ‘Bad’. You will remember that you categorized wines above the numerical value of 6.5 as ‘Good’.

As with the examples above, a single line of code makes a Distribution Chart for the `quality` feature.

visualization.distribution_chart(feature_name="quality")
Easily compare the `quality` feature from the two data profiles side-by-side. Image by the original author.

The first thing you notice will likely be the large difference in the distribution of ‘Bad’ wines between the target and reference profiles, with the reference showing many more ‘bad’ poor quality wines. You will recall that you grouped wines with alcohol content at or below 11 resulting in the reference profile. From the looks of it, this has indeed resulted in a case of sample selection bias as intended for this example, and we see that bias has resulted in drift between the two profiles for the feature ‘quality’ which was based on the raw numerical quality measures in the original dataset.

See a Differentiated View of the Distribution of a Categorical Feature

Another way to view differences in the distribution of categorical features even faster is to look only at the difference in that feature between the two profiles plotted in a single bar chart. As before, a single line of code produces the chart on your profiles.

visualization.difference_distribution_chart(feature_name="quality")
It’s clear that although the two profiles have nearly the same distribution of ‘Good’ quality wines, there’s a much bigger difference with ‘Bad’ wines. Image by the original author.

With this chart type, your eyes do the least bit of tracking to compare bars back and forth, getting you to the same conclusion as the standard distribution chart but with even less cognitive work. One chart also serves to reinforce your findings from the other, and both illustrate this type of bias you may encounter in your work.

You can also easily share this chart or any of these visual reports and charts from the profile visualizer with anyone in your organization by downloading and sending them as HTML files.

# write the Difference Distribution bar chart of the citric acid feature to file
# in an output subdirectory
import os
os.getcwd()
visualization.write(
    rendered_html=visualization.difference_distribution_chart(
        feature_name="citric acid", profile="target"
    ),
    html_file_name=os.getcwd() + "/output/diff_dist_citric_acid_example",
)
profile_visualizer_visualization_write.py hosted by GitHub

Wrap-up

Many other helpful Examples of code for learning different features of whylogs are available along with profile visualizer examples in the whylogs GitHub repository, such as for using constraints alongside the visualizer to monitor quality even more proactively.

The profile visualizer uses visual clues from your data to detect dataset drift and quality issues. It brings visual analysis into the open-source whylogs library in Python and provides evidence of how well a dataset or machine learning model stands up to the scenarios it was designed for. With it, you will not be caught off-guard when your data needs debugging or a full reboot.

Try the open-source whylogs profile visualizer with your own data!

Whether you’re a Data Engineer, Data Scientist, Machine Learning Engineer, or wear all the hats on your team, you can bring the health of your data and models into focus faster with the profile visualizer’s graphs, and limit interruption to your workflow. The profile visualizer gives you back time and brainpower to spend putting your data and models to work scaling your operations, minimizing risk, and delighting your customers.

Acknowledgments

Use of the Wine Quality Data Set from the UCI Machine Learning Repository in this learning resource is much appreciated.

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.

Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547–553, 2009.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

5 Ways to Inspect Data & Models with whylogs profile visualizer” was originally published in Towards Data Science by Kathryn Hurchla.

Other posts

How to Validate Data Quality for ML Monitoring

Data quality is one of the most important considerations for machine learning applications—and it's one of the most frequently overlooked. We explore why it’s an essential step in the MLOps process and how to check your data quality with whylogs.

A Solution for Monitoring Image Data

A breakdown of how to monitor unstructured data such as images, the types of problems that threaten computer vision systems, and a solution for these challenges.

Small Changes for Big SQLite Performance Increases

A behind-the-scenes look at how the WhyLabs engineering team improved SQLite performance to make monitoring data and machine learning models faster and easier for whylogs users.

Visually Inspecting Data Profiles for Data Distribution Shifts

This short tutorial shows how to inspect data for distribution shift issues by comparing distribution metrics and applying statistical tests for drift values calculations.

Data Logging With whylogs

Users can detect data drift, prevent ML model performance degradation, validate the quality of their data, and more in a single, lightning-fast, easy-to-use package. The v1 release brings a simpler API, new data constraints, new profile visualizations, faster performance, and a usability refresh.

Choosing the Right Data Quality Monitoring Solution

In the second article in this series, we break down what to look for in a data quality monitoring solution, open source and Saas tools available, and how to decide on the best one for your organization.

A Comprehensive Overview Of Data Quality Monitoring

In the first article in this series, we provide a detailed overview of why data quality monitoring is crucial for building successful data and machine learning systems and how to approach it.

WhyLabs Now Available in AWS Marketplace

AWS customers worldwide can now quickly deploy the WhyLabs AI Observatory to monitor, understand, and debug their machine learning models deployed in AWS.

Deploying and Monitoring Made Easy with TeachableHub and WhyLabs

Deploying a model into production and maintaining its performance can be harrowing for many Data Scientists, especially without specialized expertise and equipment. Fortunately, TeachableHub and WhyLabs make it easy to get models out of the sandbox and into a production-ready environment.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo
loading...