blog bg left
Back to Blog

Data Logging with whylogs: Profiling for Efficiency and Speed

With whylogs, the open standard for data logging, users can capture different aspects of how data and models look through snapshots, which we call data profiles. These profiles are mergeable and efficient, which we will focus on in this post. If you want to learn more or dive into whylogs before reading on, check out our documentation.

Logging data using profiles

When you log your data with whylogs, you get a data profile, a collection of metrics representing the current state of a dataset. It will carry only the necessary metadata that you need to answer questions like:

  • What is the maximum value of a column?
  • Is this column's value always between 0 and 100?
  • What is my null percentage for the entire dataset?

By creating lightweight profiles, we can answer these typical questions efficiently, without worrying too much about compute costs or storage.

We create data profiles instead of sampling because we look at every data point to decide whether it matters or not for the metric calculation (such as mean, maximum, etc.). With  sampling, you would lose information on neglected points, which could be outliers or contain something specific you would need to run a constraint logic on.

If we decided to run a full scan of the data and copy it over somewhere else, it would get too expensive, both in terms of computing and storage costs, and it'd also take too long to make our sanity checks regularly. With profiling, we can calculate statistical information on our datasets quickly and create light representations of the data while still maintaining accuracy.

To create a profile, all you have to do is install whylogs and run:

import whylogs as why
 
results = why.log(df)

Once that's done, you can create a Profile View and inspect how the data you've just profiled looks, and even turn it into a pandas DataFrame for a better look into the details:

profile_view = results.view()
profile_view.to_pandas()
>>>        counts/n  counts/null  types/integral  types/fractional  types/boolean  ...  distribution/q_75  distribution/q_90  distribution/q_95  distribution/q_99                type
column                                                                          ...                                                                                                
col_1        100            0               0               100              0  ...           0.741651           0.862725           0.930534           0.987908  SummaryType.COLUMN
col_2        100            0               0               100              0  ...           
0.713022           0.889486           0.911135           0.993780  SummaryType.COLUMN
 
[2 rows x 25 columns]

And looking at every column of this pandas DataFrame, we get:

>>> ['counts/n', 'counts/null', 'types/integral', 'types/fractional', 'types/boolean', 'types/string', 'types/object', 'cardinality/est', 'cardinality/upper_1', 'cardinality/lower_1', 'distribution/mean', 'distribution/stddev', 'distribution/n', 'distribution/max', 'distribution/min', 'distribution/q_01', 'distribution/q_05', 'distribution/q_10', 'distribution/q_25', 'distribution/median', 'distribution/q_75', 'distribution/q_90', 'distribution/q_95', 'distribution/q_99', 'type']

These data profiles can also be merged to future versions of your data, either to capture drift in your streaming ingestion service, as well as comparing static datasets over time. With the ability to merge profiles at the end of the calculation steps, you can log data in a distributed architecture.

But you might be asking yourself, if I keep merging these pieces of information, one day I'll start to pay the price of the heavy-weight files I create, right? Actually, not that much, and that's one of the main reasons whylogs is different. Because it’s incredibly efficient, even if your datasets scale to larger sizes, the profiles won’t lead to storage issues, and will also be profiled quickly. To show off whylogs’ performance, in the next section we will look at some benchmarks we ran.

Benchmarking data profiles

To better understand how much our profile grows as our profiled data grows, I have brought two different analyses that will clarify the impact on overall profile sizing with data change. First, we will compare how the profile grows when we increase the number of rows and columns, and then, we will see how fast we can profile a typical dataset.

Growing the number of rows

First, let's see what happens if we have a rather small dataset, with only four columns, 3 of them being random floats and the remaining column of the string type. We can define a function then make the data as the following code block:

import pandas as pd
 
def make_data(n_rows):
    df = pd.DataFrame(
        {
          "a": np.random.random(n_rows),
          "b": np.random.random(n_rows),
          "c": np.random.choice(["This is a very long string", "Short one", "123","345"], n_rows),
          "d": np.random.random(n_rows),
        }
    )
    return df

And for our analysis, we will call this exact function and repetitively grow the number of rows that goes into `n_rows`. By doing that, we are emulating a fixed-schema dataset that might grow in observations over time.

To profile and write the profile locally, you simply have to call:

profile = why.log(data)
profile.writer("local").write()

The results for this analysis were the following:

Interesting to see that after reaching a certain "plateau," the size will only grow a minimal amount - in the order of kilobytes - when compared to the growth in the number of rows. This enables us to store extremely small logs containing information that really matters in our datasets, without having to worry about storage costs or scalability. With whylogs, you can deep-dive into root causes of potential data drifts, and at the same time not have to worry about storage or compute costs, whether your data is small or if it scales.

This is possible because we benefit from Apache Data Sketches' solution for storing the information of our datasets accurately but lightly for each column. If you wish to learn more about their approach, you can check their comparison of KLL Sketches to Doubles Sketches for float-type data.

NOTE: After 10^7,  the dataset wouldn't fit into my local machine's memory, which happened for a 10Gb dataset, so I had to parallelize the same work with Dask, but this is out of the scope of this blogpost. If you want us to show how you could parallelize whylogs profiling, let us know in our community Slack.

Growing the number of columns

Now, with almost the same approach as we did for growing the number of rows, we will fix them and grow the data in the dimension of columns. We will go from 10 to 1000 and see what happens to the size of the profiles.

It's not straightforward to have this comprehension, but if we stop and think about it, it makes sense. Our profiling approach allows for data to grow extensively, always catching what is happening in every data point, without overloading our existing infrastructure or computer processing power. Even a dataset with 1000 columns, which is not quite common for regular tables, the profile size is extremely light and captures the essence of fluctuations in time very smoothly. For this example, we fixed the table size with 100k rows, and the largest dataset reached 1.6Gb, which is still ok for local work, but if the number of rows increased to 1Million, it would start to make it harder to handle it locally. We must consider that as the number of columns grows, the profile size will also grow linearly.

Profiling Speed

To give you an idea of how fast we profile data, we will start with a fairly large but still in-memory dataset. It has 15 columns and 1M rows and the DataFrame object has 120Mb. To measure its speed, we will simplify the analysis by only running a couple of time analysis loops, with Python's internal timeit package. We will get the largest time in 100 iterations, to see how fast the benchmark machine (a 2020 M1 Macbook Pro) can process this call, for the worst case scenario - competing with other processes.

import timeit
import numpy as np
import pandas as pd
import whylogs as why

def make_data(n_rows):
    data_dict = {}
    for number in range(15):
        data_dict[f"col_{number}"] = np.random.random(n_rows)
    return pd.DataFrame(data_dict)

data = make_data(n_rows=10 ** 6)

def profile_data():
    return why.log(data)

max_time = 0

for i in range(100):
    measure = timeit.timeit(profile_data, number=1)
    if measure > max_time:
        max_time = measure
print(max_time)

>> 0.667508

It takes at most 667 milliseconds to profile this specific 1M rows dataset on the benchmark laptop, giving us enough to know how much users can benefit from whylogs while at the same time not having to worry about compute costs for their entire pipeline.

When profiling the same dataset with 10 rows, we would have at most 4.5 milliseconds to profile. So for smaller chunks of data, profiling can be even faster. After profiling, users can also upload the profile to the Whylabs platform asynchronously, not having to worry about making the requests too heavy. If you want to read more about our latest performance improvements, check out our blog post.

Get started with data monitoring

In this blog post we talked about the importance of monitoring datasets, and learned how efficient whylogs can be. If you want to learn more or try whylogs for yourself, go to our examples page and check out our different use cases.

References

Other posts

Achieving Ethical AI with Model Performance Tracing and ML Explainability

With Model Performance Tracing and ML Explainability, we’ve accelerated our customers’ journey toward achieving the three goals of ethical AI - fairness, accountability and transparency.

BigQuery Data Monitoring with WhyLabs

We’re excited to announce the release of a no-code solution for data monitoring in Google BigQuery, making it simple to monitor your data quality without writing a single line of code.

Robust & Responsible AI Newsletter - Issue #4

Every quarter we send out a roundup of the hottest MLOps and Data-Centric AI news including industry highlights, what’s brewing at WhyLabs, and more.

WhyLabs Private Beta: Real-time Data Monitoring on Prem

We’re excited to announce our Private Beta release of an extension service for the Profile Store, enabling production use cases of whylogs on customers' premises.

Understanding Kolmogorov-Smirnov (KS) Tests for Data Drift on Profiled Data

We experiment with statistical tests, Kolmogorov-Smirnov (KS) specifically, applied to full datasets and dataset profiles and compare the results.

Re-imagine Data Monitoring with whylogs and Apache Spark

An overview of how the whylogs integration with Apache Spark achieves large scale data profiling, and how users can apply this integration into existing data and ML pipelines.

ML Monitoring in Under 5 Minutes

A quick guide to using whylogs and WhyLabs to monitor common issues with your ML models to surface data drift, concept drift, data quality, and performance issues.

AIShield and WhyLabs: Threat Detection and Monitoring for AI

The seamless integration of AIShield’s security insights on WhyLabs AI observability platform delivers comprehensive insights into ML workloads and brings security hardening to AI-powered enterprises.

Large Scale Data Profiling with whylogs and Fugue on Spark, Ray or Dask

Profiling large-scale data for use cases such as anomaly detection, drift detection, and data validation with Fugue on Spark, Ray or Dask.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo
loading...