Data Logging: Sampling versus Profiling
Isaac BackusBernease Herman
Oct 29, 2020
- Data Science
- Artificial Intelligence
- Machine Learning
by Isaac Backus and Bernease Herman
In traditional software, logging and instrumentation have been adopted as standard practice to create transparency and to make sense of the health of a complex system. When it comes to AI applications, the lack of tools and standardized approaches mean that logging is often spotty and incomplete. Here, I compare two approaches to data logging: sampling and profiling.
I have two goals in this post. First, I will demonstrate that profiling is superior to sampling. Profiling provides a lightweight, robust approach to characterizing distributions for all types of data encountered in ML. Next, I want to convince every data scientist to give data logging a shot. To that end, I present whylogs: an open source library developed by the team here at WhyLabs. The whylogs approach is suitable for any ML framework and enables scalable, statistical data logging and profiling in only a few lines of code.
In a post on Towards Data Science I argued that data logging is required for robust, mature ML/AI applications. I also outlined five requirements for logging software. Whylogs, particularly in combination with the WhyLabs AI Observability Platform, aims at hitting those targets. The requirements are:
- Ease of use
- Standardized and portable
- Close to the code
There are two main approaches to data logging: profiling and sampling. Let’s see why profiling (as implemented by whylogs) beats out sampling.
Data sampling is a basic method for trying to monitor data in production environments. The idea is simple: randomly or programmatically select samples of data from a larger data stream and store them for later analysis. Implementing sampling typically requires no special software and can be achieved with little extra up-front design.
However, there are some drawbacks to sampling which profiling attempts to address. For example, sampling can involve large I/O and storage costs. It tends to be noisy, and even though implementation is straightforward, consuming samples generally still requires statistical analysis specific to the dataset. The output data format depends on the input data, making it difficult to create generic tooling for consuming the events.
Rare events and outliers are missed at a high frequency, and distribution metrics such as min/max or the number of unique values cannot be accurately estimated. These metrics are of special importance in logging cases since outliers and rare events are often correlated with data issues.
In contrast, profiling collects statistical measurements of the data. In the case of whylogs, the metrics produced come with mathematically derived uncertainty bounds. These profiles are scalable, lightweight, flexible, and configurable. Rare events and outlier-dependent metrics can be accurately captured. The results are statistical and of a standard, portable data format which are directly interpretable. Whylogs packages this all up and includes multi-language support, ease of use, reliability, and flexibility.
Whylogs implements a number of useful statistics for data profiling. All statistics are collected in a streaming fashion. Using this approach requires only a single pass over the data with minimal memory overhead, and is naturally parallelizable. The resulting profiles are all merge-able, allowing statistics from multiple hosts, data partitions, or datasets, to be merged post-hoc. The approach is therefore trivially parallelizable and map-reducible, making it highly scalable.
Certain statistics can be tracked exactly, such as record count, data type counts, null count, min, max, and mean. Others — such as quantiles, histograms, or cardinality — require approximate statistics.
Profiling vs sampling — experimental results
To compare profiling and sampling, I ran a number of experiments. The results demonstrate profiling’s improved accuracy over sampling, especially regarding outlier-dependent metrics, long tail distributions, and metrics such as cardinality estimates (number of unique values). Here I present two sets of experiments:the first targets distributional metrics, and the second targets unique value counts.
Experiment 1 — distributional metrics
The first set of experiments were run as follows:
- Select a distribution to choose from (outlined below)
- Randomly sample 10⁵ records
- Sample a subset of
n_samplerecords such that the subset is as many bytes as the profile. This is to compare apples to apples. Accuracy can be improved for sampling and profiling by increasing the data size.
- Compare with exact values
- Repeat steps 2 through 4 for a total of 24 runs and average the results
- Repeat for every distribution
As can be seen in the figures below, across all distribution types and for every metric, profiles outperform samples. This is especially clear for the long-tail pareto distribution which produces “outliers.” Outlier-related metrics cannot be captured by random sampling.
Experiment 2 — unique value counts
When it comes to estimating the number of unique values, particularly at high cardinality or in un-balanced datasets where certain categories are rare, profiling significantly outperforms sampling.
This experiment proceeds as follows:
- Select a distribution and a number of unique values (
- Randomly sample 10⁶ records
- Sample a subset of
n_samplerecords such that the subset is as many bytes as the profile.
- Estimate number of unique values (for both methods)
- Repeat steps 2 through 4 for a total of 15 runs and average the results
- Repeat steps 1 through 5 with a new choice of unique value count
- Repeat for all distributions
As can be seen in the figure below, only in the case of a uniform distribution and fairly low cardinality does sampling accurately estimate the number of categories.
Experiment 2 distributions
Beyond their statistical accuracy, another motivation for data logging with profiles is how well they can be used for automated monitoring of ML/AI applications and pipelines. There are a number of reasons that make profiles especially well suited for monitoring. Profiles are:
This encourages broad monitoring across many data sources. There is very little cost in terms of person hours (implementation), storage, or compute.
whylogs profiles are a standardized, cross-language, cross-platform format. They provide monitoring targets consistent across customers, platforms, databases, etc.
Monitoring on arbitrary data can add an additional complex, fragile data layer to an already complex system. The structured, standardized profiles are much simpler than samples of arbitrary data.
Profiles produce interpretable statistics and signals, which is entirely necessary for debugging data pipelines and understanding model performance.
Statistical monitoring algorithms are more interpretable and robust than black box ML monitoring. One can include statistical knowledge of the profiles when designing monitoring algorithms.
To see what monitoring and data logging can look like in production, check out the live sandbox demo of the WhyLabs Platform service with built-in monitoring which consumes statistical profiles generated by whylogs.
Machine learning applications are, by their nature, statistical. Profiling is a broadly applicable approach to characterizing distributions, making it viable for all types of data encountered in ML. Additionally, since the whylogs approach is streaming, trivially parallelizable, and map-reducible, it is naturally suited to all ML frameworks. At WhyLabs, our goal is to make data logging available and easy to implement for all AI practitioners.
The data explored in this blog post has been primarily structured data, but the team here at WhyLabs is already working on implementing profiling in whylogs for images, natural language data, time series data, and more. Current integrations include python, java, pandas, numpy, spark, MLFlow, and more. We are rapidly expanding these to target every ML environment.
Check us out
Open source library
- whylogs Python
- whylogs Java — https://github.com/whylabs/whylogs-java
- WhyLabs Platform Sandbox Demo — https://try.whylabsapp.com
- WhyLabs.ai — https://whylabs.ai/
- Towards Data Science: Sampling isn’t enough, profile your ML data instead — https://towardsdatascience.com/sampling-isnt-enough-profile-your-ml-data-instead-6a28fcfb2bd4
- Join the WhyLabs Slack community to discuss ideas and give feedback on data logging and AI observability.
Don’t Let Your Data Fail You; Continuous Data Validation with whylogs and Github Actions
Jul 20, 2021
- Data Logging
- Data Validation
- Github Actions
WhyLabs' Data Geeks Unleashed
Alessya VisnjicLeandro G. AlmeidaAndy DangBernease Herman
May 21, 2021
- Data Science
- Thought Leadership
- Big Data
- Data Analytics
- Data Logging
Integrating whylogs into your Kafka ML Pipeline
Chris WarthAlessya Visnjic
Apr 7, 2021
- Machine Learning