5 Ways to Inspect Data & Models with whylogs Profile Visualizer
- Data Logging
- Data Visualization
Jul 12, 2022
Spot Data Drift and Ensure Machine Learning Model Quality
The open-source whylogs profile visualizer enables visual observability of data in machine learning and data engineering with Python. The addition of visualization to data observability further distills and boosts the signals that can be captured in whylogs data logging profiles to help you understand what’s happening in your data, identify and correct issues quickly, and maintain the quality and relevance of high-performing data and machine learning models. The profile visualizer runs right in a Jupyter Notebook along with your logging code and grows the interpretative capabilities of the open-source whylogs library quickly becoming the standard for data logging and monitoring of ML models and data pipelines.
As a member of the Robust & Responsible AI Community on Slack, and being pretty vocal about my passion for visual analytics, as usual, I was introduced to the profile visualizer there by members of the WhyLabs team as it was being developed. This further sparked my curiosity about how visualization fits in with closing a gap that exists in data observability once a dataset or model is in production — watching how it’s behaving out in the world.
Felipe Adachi wrote the example code notebook that’s been the basis for me to learn to use the profile visualizer, which I’ve highlighted excerpts from in this post and extended with additional examples you can try from my own notebook. Both complement this post step by step. Felipe and Danny Leybzon both generously answered my questions and provided feedback as I wrote this article from their perspective as maintainers of whylogs and the profile visualizer and for accuracy with the developing v1 code base that was just released.
Learn to effectively use the profile visualizer
- to examine data logged at key input and output junctures in your pipeline
- to spot differences in the distributions of two data profiles at a glance
- to quickly pinpoint data drift in your machine learning operations
Review of Logging and Statistical Profiles
Briefly, the value of data logging and how whylogs profiles work are fundamental to appreciate before diving into the profile visualizer. Data pipelines today, with the integration of machine learning and artificial intelligence, flow with unsurpassed acceleration and at a scale that can make keeping an eye on your data difficult. For the same reasons monitoring your data is more important than ever to maintain its integrity.
The first step is capturing logs close to your data workflow and at the points where data enters or exits the pipeline or goes through major transformations that could introduce or highlight issues, such as missing values or changes in data types upstream that have not been accounted for. In order to use the profile visualizer, this means logging data profiles with Python.
The profile visualizer can be used with any kind of data you capture in whylogs profiles to detect different drift-causing scenarios and other data quality and model performance issues. That includes structured or unstructured data, and batch or streaming data. The whylogs profiles are lightweight summaries of your data that are well-calibrated to the scope and degree of approximation appropriate for data monitoring, no more and no less than necessary. Of particular relevance to monitoring and visualization are the lightweight nature of whylogs profiles and their greater sensitivity to outliers as compared to traditional sampling methods. Rare events that could indicate a quality issue are easy to spot.
Logging data in a way that’s more specific than necessary however could easily become a resource drag on your team and infrastructure and a new form of technical debt. Choosing to log these aggregate statistical profiles instead of storing even small samples of detailed data could also save you the risk of hefty legal fines with heightened regulations around data ownership and privacy.
How to Visually Examine Statistics of a Single Profile
First, see how any feature of a single whylogs data profile looks with turnkey feature statistics. Performing this kind of univariate analysis, or analysis of one feature, is a quick way to describe your data during exploratory analysis, anticipate patterns, and understand key aspects like cardinality, central tendency, dispersion, variability, and the occurrence of missing values.
A Jupyter Notebook containing full code for this and other examples in this post can be opened in Colab or downloaded or forked with other examples in the same GitHub repository. I refer to data features throughout this post because I approach the visualizer more from data science than engineering. Whether you call these data variables, features, or columns of data you can use these charts and reports to get eyes on your data and get ahead of issues.
Prepare a Notebook for profile visualizer
First, install the recently released
whylogs v1 package along with the
viz package to use the profile visualizer. I use a macOS machine running the zsh shell language by default in its Terminal, so running the first code cell below does the job for me. If you have previously installed whylogs, be sure to include the
--upgrade option because the profile visualizer is available only since the recent whylogs v1 release.
To use the profile visualizer, install
whylogs v1 with the extra package
You can install it directly in your notebook by prepending the terminal/shell command with
! exclamation point.
# in a notebook, if you are using zsh, # escape command in hard brackets !pip install -q --upgrade whylogs\[viz\] # # or directly in a terminal/shell window with zsh # pip install --upgrade whylogs\[viz\] # # or in a notebook with bash # !pip install -q --upgrade whylogs[viz] # # or directly in a terminal/shell window with bash # pip install --upgrade whylogs[viz]
This code gist also shows the install command for a few different scenarios. Your operating system and where you choose to run the install command impacts which syntax will work for you. With the Z shell (Zsh), it’s necessary to escape the
viz package in hard brackets to prevent it from being interpreted as a pattern. Install packages directly in your Jupyter Notebook by prepending the command with an exclamation point, in which case try adding the
-q quiet command option if you don’t want to see all of the automatic output from the installation process.
Load Data for the Examples
An example of comparing two profile views from a well-known machine learning dataset is where things get really interesting with profile visualizer! Load the data then print a concise summary of the pandas dataframe.
import pandas as pd pd.options.mode.chained_assignment = None # Disabling false positive warning url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv" wine = pd.read_csv(url,sep=";") wine.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed acidity 1599 non-null float64 1 volatile acidity 1599 non-null float64 2 citric acid 1599 non-null float64 3 residual sugar 1599 non-null float64 4 chlorides 1599 non-null float64 5 free sulfur dioxide 1599 non-null float64 6 total sulfur dioxide 1599 non-null float64 7 density 1599 non-null float64 8 pH 1599 non-null float64 9 sulphates 1599 non-null float64 10 alcohol 1599 non-null float64 11 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB
Notice that the quality feature contains integer values. You will transform that into a categorical data type during this demo to make quick charts to see your data through a lens of ‘good’ or ‘bad’.
Initialize the profile visualizer
After you have imported the Python libraries you need for this example, load the wine quality dataset, log your dataframe in a whylogs profile, and generate a profile view to use with the profile visualizer. Then, instantiate `NotebookProfileViewer` and set your target profile. Most of the code featured in this post is also available in a single Jupyter Notebook. Select the Open in Colab button there or download the notebook directly.
Once that’s done, a single line of code gets you a variety of useful summary statistics, as shown in this example based on the citric acid feature.
visualization.feature_statistics(feature_name="citric acid", profile="target")
Feature statistics are organized with a top-level band of the most basic but still highly informative data aggregations. The percentage of values that are distinct can indicate a degree of consistency (high %) or inconsistency (low %). Here, you see that only 18% of citric acid values in the Target profile are distinct.
Quantile statistics speak to how your data are distributed across the range of values that occur. The Interquartile range of citric acid in this portion of the UCI Wine dataset of 0.3 is low relative to 1.0. If your mental image of the data, the middle 50% of your values should then be clustered closer to each other around the median. Conversely, a higher IQR such as 0.8 would show values spread more widely across the range.
Among the Descriptive statistics, the Coefficient of variation is the Standard deviation divided by the Mean. As a ratio, it’s independent of the unit scale of its feature and can be compared to the CV of another feature in the data as long as both features have continuous numerical scales that start with zero. For example, a 0.72 or 72% CV of citric acid indicates widely dispersed values. If you compare that to another feature’s CV of .30, for example, you learn that citric acid is more widely dispersed than the other feature. However, the CV equation is sensitive to small differences in the Mean when it nears zero, and in those cases, the Standard deviation itself is more informative, so it’s important to consider all of these statistics in context.
How to Compare Two Data Profiles
With the profile visualizer comes exciting improvements to working with two whylogs profiles for easy visual comparisons directly in a notebook. The next steps walk you through running a Summary Drift Report for an overview of statistics about all dataset features and the differences in each attribute that represent the amount of drift between a target and a reference profile. Then, short demos illustrate the chart types that are available with the visualizer to focus on a single numerical or categorical feature across two profiles, and when each chart is most useful.
Monitoring your data against a baseline target, and being able to see it, is akin to a crystal ball when you haven’t had that clarity! Data quality can deteriorate once a machine learning model is moved into production, and it’s also critical to monitor the performance of a model to detect drift or a change in the relationships between the model’s input and output data.
One cause of data drift in machine learning that the visualizer can help you uncover is selection bias, which is introduced when a model is trained on data that is not representative of the entire population it’s intended to perform on. The example below illustrates this common scenario by using biased criteria for splitting the dataset into two profiled groups. However, To understand different root causes of drift and when each might show up in the pattern of your model data, A Primer on Data Drift is an excellent read.
Initialize Target and Reference Profiles in the profile visualizer
Now, split the dataset in this case study of wine quality into two groups to intentionally create a sample selection bias scenario where the training sample is not representative of the population. Load the wine quality dataset as before. The first group will include wines with alcohol content at or below 11 and will be considered your baseline (or reference) dataset, and the second group will include wines with an alcohol content above 11 as your target dataset.
Prepare the two groups, and then pass the whylogs profile for each to the profile visualizer for comparisons between a target and reference profile. The
quality feature is a numerical one, representing the wine's quality. Transform it to a categorical feature, where each wine is classified as Good or Bad. Anything above 6.5 is a good wine. Otherwise, it's bad.
Now, profile the dataframes with
whylogs, and create
profile_views as arguments to feed into the
If this seems like a lot of setting up, then you’re in luck because from here on out all it takes to get a variety of relevant charts and reports are single lines of code.
View a Summary Drift Report of Your Profiles
Compare many features of the target and reference profiles side by side all at once. In a Summary Drift Report, you will see overview statistics, such as the number of observations and missing cells, as well as comparisons of the distributions of each feature. Drift for each specific numerical or categorical feature is calculated for you in the report, and alerts will be displayed related to the drift severity of each feature.
You only need to enter a single line of code to run a Summary Drift Report on the profiles that you already passed to the profile visualizer when you instantiated it in the step above.
A Summary Drift Report of the wine quality example data profiles showing some of the histograms it contains for individual features, along with the differences, counts, and means for each. Video by the author.
Note that there is a search input box in the upper right corner of the summary drift report, where you can also search for a specific data feature, such as ‘quality’ or filter by inferred type. The code gist below shows the code from start to finish to make a summary drift report.
Compare Wine Quality in a Summary Drift Report
# Load the data import pandas as pd pd.options.mode.chained_assignment = None # Disabling false positive warning url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv" wine = pd.read_csv(url,sep=";") # Split the wines into two groups: # with alcohol below 11 as the reference sample cond_reference = (wine['alcohol']<=11) wine_reference = wine.loc[cond_reference] # with alcohol above 11 as the target dataset cond_target = (wine['alcohol']>11) wine_target = wine.loc[cond_target] # Add some missing values to `citric acid`, to see how this is reflected in profile visualizer ixs = wine.iloc[100:110].index wine.loc[ixs,'citric acid'] = None # Transform the numeric `quality` feature to a categorical feature, # where each wine is classified as Good (above 6.5) or Bad bins = (2, 6.5, 8) group_names = ['bad', 'good'] wine_reference['quality'] = pd.cut(wine_reference['quality'], bins = bins, labels = group_names) wine_target['quality'] = pd.cut(wine_target['quality'], bins = bins, labels = group_names) # Profile the dataframes with whylogs import whylogs as why result = why.log(pandas=wine_target) # Create profile_views as arguments to feed into the NotebookProfileVisualizer prof_view = result.view() # Repeat for the reference profile result_ref = why.log(pandas=wine_reference) prof_view_ref = result_ref.view() # Instantiate NotebookProfileViewer from whylogs.viz import NotebookProfileVisualizer visualization = NotebookProfileVisualizer() # Set the target and reference profiles visualization.set_profiles(target_profile_view=prof_view, reference_profile_view=prof_view_ref) # Run a summary_drift_report visualization.summary_drift_report()
Overlay Profiles in a Double Histogram for Numerical Features
Differences in a single numerical feature can be seen at a high level quickly using histograms, which provide an approximate picture of data distribution and frequency. When working with a critical integer or float data feature, the Double Histogram in the profile visualizer is the chart where you could pre-emptively catch unanticipated variability, for example, while monitoring a machine learning model in production, and use that knowledge to reassess your data collection and selection choices all the way back through your pipeline (or as far back as necessary until you find the root cause).
Histograms of your current Target and Reference profiles are displayed in a single graph figure with pre-set color encodings and opacity levels that effectively distinguish the two profiles and result in a blended tone where the data layers overlap. Bin sizes for the histograms were assigned earlier when you initialized the profile visualizer. That dynamic binning ensures a common scale, interval ranges, and the fidelity of the comparisons you make in interpreting the charts.
Focus on the `density` feature from both data profiles in this example. From this succinct chart, you can assess the similarity in `density` in two profiles, see in what range of values the two datasets intersect, and gauge by how much the mean values differ at a glance. With a single line of code again, a double histogram can be made with your reference data overlaid on your target data.
Immediately, a large shift emerges between our Target and Reference data for `density`. The change can be seen in a few ways, although both the blue Target bars and the orange Reference bars both still skew somewhat to the right.
- The distribution of the values on the x-axis has shrunk overall, from Target sprawling nearly the span of the axis to Reference clustering closer to its mean, with less outside 1–2 standard deviation if you were to calculate that.
- Frequency is now more normalized in the Reference; whereas more randomness appeared in the Target.
- Nearly 100 more data points fall in the mean of the Reference than did in the Target’s mean.
In our example, this is not surprising because the two profiles were intentionally split to illustrate a selection bias, by putting wines with low alcohol content in one data profile and wines with high alcohol in the other. Quite clearly, that decision with training data selection would result in poor performance if such a model met with wines outside the alcohol range it was trained on and would need to be re-trained.
Imagine a real-world scenario when a model is in production. If the Target in this example is a profile of input data logged when the model was first deployed into production, and the Reference is a profile that was logged at the same input point but at a later time, then your expectation with observability is that the two profiles, and any others logged at that point, will look the same or show only an acceptable degree of difference. In other words, a shift this noticeable indicates a likelihood of an issue with the data being fed into your model, and as a result any resulting output as well.
See Distribution of Categorical Features
In a distribution chart, the differences in categorical data elements between your two profiles will be clear. Using this chart based on the ‘quality’ feature you created during the data preparation steps, you will see how much of each wine data profile is ‘Good’ and how much is ‘Bad’. You will remember that you categorized wines above the numerical value of 6.5 as ‘Good’.
As with the examples above, a single line of code makes a Distribution Chart for the `quality` feature.
The first thing you notice will likely be the large difference in the distribution of ‘Bad’ wines between the target and reference profiles, with the reference showing many more ‘bad’ poor quality wines. You will recall that you grouped wines with alcohol content at or below 11 resulting in the reference profile. From the looks of it, this has indeed resulted in a case of sample selection bias as intended for this example, and we see that bias has resulted in drift between the two profiles for the feature ‘quality’ which was based on the raw numerical quality measures in the original dataset.
See a Differentiated View of the Distribution of a Categorical Feature
Another way to view differences in the distribution of categorical features even faster is to look only at the difference in that feature between the two profiles plotted in a single bar chart. As before, a single line of code produces the chart on your profiles.
With this chart type, your eyes do the least bit of tracking to compare bars back and forth, getting you to the same conclusion as the standard distribution chart but with even less cognitive work. One chart also serves to reinforce your findings from the other, and both illustrate this type of bias you may encounter in your work.
You can also easily share this chart or any of these visual reports and charts from the profile visualizer with anyone in your organization by downloading and sending them as HTML files.
Many other helpful Examples of code for learning different features of whylogs are available along with profile visualizer examples in the whylogs GitHub repository, such as for using constraints alongside the visualizer to monitor quality even more proactively.
The profile visualizer uses visual clues from your data to detect dataset drift and quality issues. It brings visual analysis into the open-source whylogs library in Python and provides evidence of how well a dataset or machine learning model stands up to the scenarios it was designed for. With it, you will not be caught off-guard when your data needs debugging or a full reboot.
Try the open-source whylogs profile visualizer with your own data!
Whether you’re a Data Engineer, Data Scientist, Machine Learning Engineer, or wear all the hats on your team, you can bring the health of your data and models into focus faster with the profile visualizer’s graphs, and limit interruption to your workflow. The profile visualizer gives you back time and brainpower to spend putting your data and models to work scaling your operations, minimizing risk, and delighting your customers.
Use of the Wine Quality Data Set from the UCI Machine Learning Repository in this learning resource is much appreciated.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547–553, 2009.
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
“5 Ways to Inspect Data & Models with whylogs profile visualizer” was originally published in Towards Data Science by Kathryn Hurchla.
Re-imagine Data Monitoring with whylogs and Apache Spark
Nov 23, 2022
- Apache Spark
AIShield and WhyLabs: Threat Detection and Monitoring for AI
Nov 8, 2022
- AI Observability
Large Scale Data Profiling with whylogs and Fugue on Spark, Ray or Dask
Oct 13, 2022
Monitoring Image Data with whylogs v1
Oct 5, 2022
- Image Data
WhyLabs Private Beta: Real-time, No-code, Cloud Storage Data Profiling
Oct 3, 2022
Data and ML Monitoring is Easier with whylogs v1.1
Sep 28, 2022
- Data Logging
- ML Monitoring
Model Monitoring for Financial Fraud Classification
Sep 19, 2022
- AI Observability