blog bg left
Back to Blog

Data Logging: Sampling versus Profiling

by Isaac Backus and Bernease Herman

In traditional software, logging and instrumentation have been adopted as standard practice to create transparency and to make sense of the health of a complex system. When it comes to AI applications, the lack of tools and standardized approaches mean that logging is often spotty and incomplete. Here, I compare two approaches to data logging: sampling and profiling.

I have two goals in this post. First, I will demonstrate that profiling is superior to sampling. Profiling provides a lightweight, robust approach to characterizing distributions for all types of data encountered in ML. Next, I want to convince every data scientist to give data logging a shot. To that end, I present whylogs: an open source library developed by the team here at WhyLabs. The whylogs approach is suitable for any ML framework and enables scalable, statistical data logging and profiling in only a few lines of code.


In a post on Towards Data Science I argued that data logging is required for robust, mature ML/AI applications. I also outlined five requirements for logging software. Whylogs, particularly in combination with the WhyLabs AI Observability Platform, aims at hitting those targets. The requirements are:

  • Ease of use
  • Lightweight
  • Standardized and portable
  • Configurable
  • Close to the code

There are two main approaches to data logging: profiling and sampling. Let’s see why profiling (as implemented by whylogs) beats out sampling.


Data sampling is a basic method for trying to monitor data in production environments. The idea is simple: randomly or programmatically select samples of data from a larger data stream and store them for later analysis. Implementing sampling typically requires no special software and can be achieved with little extra up-front design.

However, there are some drawbacks to sampling which profiling attempts to address. For example, sampling can involve large I/O and storage costs. It tends to be noisy, and even though implementation is straightforward, consuming samples generally still requires statistical analysis specific to the dataset. The output data format depends on the input data, making it difficult to create generic tooling for consuming the events.

Rare events and outliers are missed at a high frequency, and distribution metrics such as min/max or the number of unique values cannot be accurately estimated. These metrics are of special importance in logging cases since outliers and rare events are often correlated with data issues.


In contrast, profiling collects statistical measurements of the data. In the case of whylogs, the metrics produced come with mathematically derived uncertainty bounds. These profiles are scalable, lightweight, flexible, and configurable. Rare events and outlier-dependent metrics can be accurately captured. The results are statistical and of a standard, portable data format which are directly interpretable. Whylogs packages this all up and includes multi-language support, ease of use, reliability, and flexibility.

whylogs profiles

Whylogs implements a number of useful statistics for data profiling. All statistics are collected in a streaming fashion. Using this approach requires only a single pass over the data with minimal memory overhead, and is naturally parallelizable. The resulting profiles are all merge-able, allowing statistics from multiple hosts, data partitions, or datasets, to be merged post-hoc. The approach is therefore trivially parallelizable and map-reducible, making it highly scalable.

Certain statistics can be tracked exactly, such as record count, data type counts, null count, min, max, and mean. Others — such as quantiles, histograms, or cardinality — require approximate statistics.

Profiling vs sampling — experimental results

To compare profiling and sampling, I ran a number of experiments. The results demonstrate profiling’s improved accuracy over sampling, especially regarding outlier-dependent metrics, long tail distributions, and metrics such as cardinality estimates (number of unique values). Here I present two sets of experiments:the first targets distributional metrics, and the second targets unique value counts.

Experiment 1 — distributional metrics

The first set of experiments were run as follows:

  1. Select a distribution to choose from (outlined below)
  2. Randomly sample 10⁵ records
  3. Sample a subset of n_sample records such that the subset is as many bytes as the profile. This is to compare apples to apples. Accuracy can be improved for sampling and profiling by increasing the data size.
  4. Compare with exact values
  5. Repeat steps 2 through 4 for a total of 24 runs and average the results
  6. Repeat for every distribution

As can be seen in the figures below, across all distribution types and for every metric, profiles outperform samples. This is especially clear for the long-tail pareto distribution which produces “outliers.” Outlier-related metrics cannot be captured by random sampling.

Mean estimates: errors in the estimate of the mean for sampling vs profiling for various distributions. Mean absolute error and mean relative (fractional) absolute error are shown. Profiling errors are too small to be seen.
Median estimates: errors in the estimate of the median for sampling vs profiling for various distributions. Mean absolute error and mean relative (fractional) absolute error are shown.
Edge quantiles: errors in the estimates of the 0.05 and 0.95 quantiles for sampling vs profiling for various distributions. Mean absolute error and mean relative (fractional) absolute error are shown.
Max: errors and bias in the estimate of maximum for sampling vs profiling for various distributions. The mean relative error (left) and mean relative bias (right) are shown. The relative bias is calculated as (bias)/(true value). Profiling errors are too small to be seen.
Min: errors in the estimate of the minimum for sampling vs profiling for various distributions. The mean relative error (right) and mean relative bias (left) are shown. The relative bias is calculated as (bias)/(true value) . Profiling errors are too small to be seen.

Experiment 2 — unique value counts

When it comes to estimating the number of unique values, particularly at high cardinality or in un-balanced datasets where certain categories are rare, profiling significantly outperforms sampling.

This experiment proceeds as follows:

  1. Select a distribution and a number of unique values (n_true)
  2. Randomly sample 10⁶ records
  3. Sample a subset of n_sample records such that the subset is as many bytes as the profile.
  4. Estimate number of unique values (for both methods)
  5. Repeat steps 2 through 4 for a total of 15 runs and average the results
  6. Repeat steps 1 through 5 with a new choice of unique value count n_true
  7. Repeat for all distributions

As can be seen in the figure below, only in the case of a uniform distribution and fairly low cardinality does sampling accurately estimate the number of categories.

Experiment 2 distributions

Unique value counts: Errors in the estimate of number of unique values (y-axis) are displayed for profiling vs sampling across different (discrete) distributions for a range of unique value counts (x-axis). Results are for 1M samples.

Monitoring profiles

Beyond their statistical accuracy, another motivation for data logging with profiles is how well they can be used for automated monitoring of ML/AI applications and pipelines. There are a number of reasons that make profiles especially well suited for monitoring. Profiles are:

  • Lightweight
    This encourages broad monitoring across many data sources. There is very little cost in terms of person hours (implementation), storage, or compute.
  • Controlled
    whylogs profiles are a standardized, cross-language, cross-platform format. They provide monitoring targets consistent across customers, platforms, databases, etc.
  • Simple
    Monitoring on arbitrary data can add an additional complex, fragile data layer to an already complex system. The structured, standardized profiles are much simpler than samples of arbitrary data.
  • Human-centered
    Profiles produce interpretable statistics and signals, which is entirely necessary for debugging data pipelines and understanding model performance.
  • Statistical
    Statistical monitoring algorithms are more interpretable and robust than black box ML monitoring. One can include statistical knowledge of the profiles when designing monitoring algorithms.

Monitoring demo

To see what monitoring and data logging can look like in production, check out the live sandbox demo of the WhyLabs Platform service with built-in monitoring which consumes statistical profiles generated by whylogs.


Machine learning applications are, by their nature, statistical. Profiling is a broadly applicable approach to characterizing distributions, making it viable for all types of data encountered in ML. Additionally, since the whylogs approach is streaming, trivially parallelizable, and map-reducible, it is naturally suited to all ML frameworks. At WhyLabs, our goal is to make data logging available and easy to implement for all AI practitioners.

The data explored in this blog post has been primarily structured data, but the team here at WhyLabs is already working on implementing profiling in whylogs for images, natural language data, time series data, and more. Current integrations include python, java, pandas, numpy, spark, MLFlow, and more. We are rapidly expanding these to target every ML environment.

Check us out

Open source library

Other posts

Get Early Access to the First Purpose-Built Monitoring Solution for LLMs

We’re excited to announce our private beta release of LangKit, the first purpose-built large language model monitoring solution! Join the responsible LLM revolution by signing up for early access.

Mind Your Models: 5 Ways to Implement ML Monitoring in Production

We’ve outlined five easy ways to monitor your ML models in production to ensure they are robust and responsible by monitoring for concept drift, data drift, data quality, AI explainability and more.

Simplifying ML Deployment: A Conversation with BentoML's Founder & CEO Chaoyu Yang

A summary of the live interview with Chaoyu Yang, Founder & CEO at BentoML, on putting machine learning models in production and BentoML's role in simplifying deployment.

Data Drift vs. Concept Drift and Why Monitoring for Them is Important

Data drift and concept drift are two common challenges that can impact ML models on production. In this blog, we'll explore the differences between these two types of drift and why monitoring for them is crucial.

Robust & Responsible AI Newsletter - Issue #5

Every quarter we send out a roundup of the hottest MLOps and Data-Centric AI news including industry highlights, what’s brewing at WhyLabs, and more.

Detecting Financial Fraud in Real-Time: A Guide to ML Monitoring

Fraud is a significant challenge for financial institutions and businesses. As fraudsters constantly adapt their tactics, it’s essential to implement a robust ML monitoring system to ensure that models effectively detect fraud and minimize false positives.

How to Troubleshoot Embeddings Without Eye-balling t-SNE or UMAP Plots

WhyLabs' scalable approach to monitoring high dimensional embeddings data means you don’t have to eye-ball pretty UMAP plots to troubleshoot embeddings!

Achieving Ethical AI with Model Performance Tracing and ML Explainability

With Model Performance Tracing and ML Explainability, we’ve accelerated our customers’ journey toward achieving the three goals of ethical AI - fairness, accountability and transparency.

Detecting and Fixing Data Drift in Computer Vision

In this tutorial, Magdalena Konkiewicz from Toloka focuses on the practical part of data drift detection and fixing it on a computer vision example.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo