blog bg left
Back to Blog

Data Logging with whylogs: Profiling for Efficiency and Speed

With whylogs, the open standard for data logging, users can capture different aspects of how data and models look through snapshots, which we call data profiles. These profiles are mergeable and efficient, which we will focus on in this post. If you want to learn more or dive into whylogs before reading on, check out our documentation.

Logging data using profiles

When you log your data with whylogs, you get a data profile, a collection of metrics representing the current state of a dataset. It will carry only the necessary metadata that you need to answer questions like:

  • What is the maximum value of a column?
  • Is this column's value always between 0 and 100?
  • What is my null percentage for the entire dataset?

By creating lightweight profiles, we can answer these typical questions efficiently, without worrying too much about compute costs or storage.

We create data profiles instead of sampling because we look at every data point to decide whether it matters or not for the metric calculation (such as mean, maximum, etc.). With  sampling, you would lose information on neglected points, which could be outliers or contain something specific you would need to run a constraint logic on.

If we decided to run a full scan of the data and copy it over somewhere else, it would get too expensive, both in terms of computing and storage costs, and it'd also take too long to make our sanity checks regularly. With profiling, we can calculate statistical information on our datasets quickly and create light representations of the data while still maintaining accuracy.

To create a profile, all you have to do is install whylogs and run:

import whylogs as why
 
results = why.log(df)

Once that's done, you can create a Profile View and inspect how the data you've just profiled looks, and even turn it into a pandas DataFrame for a better look into the details:

profile_view = results.view()
profile_view.to_pandas()
>>>        counts/n  counts/null  types/integral  types/fractional  types/boolean  ...  distribution/q_75  distribution/q_90  distribution/q_95  distribution/q_99                type
column                                                                          ...                                                                                                
col_1        100            0               0               100              0  ...           0.741651           0.862725           0.930534           0.987908  SummaryType.COLUMN
col_2        100            0               0               100              0  ...           
0.713022           0.889486           0.911135           0.993780  SummaryType.COLUMN
 
[2 rows x 25 columns]

And looking at every column of this pandas DataFrame, we get:

>>> ['counts/n', 'counts/null', 'types/integral', 'types/fractional', 'types/boolean', 'types/string', 'types/object', 'cardinality/est', 'cardinality/upper_1', 'cardinality/lower_1', 'distribution/mean', 'distribution/stddev', 'distribution/n', 'distribution/max', 'distribution/min', 'distribution/q_01', 'distribution/q_05', 'distribution/q_10', 'distribution/q_25', 'distribution/median', 'distribution/q_75', 'distribution/q_90', 'distribution/q_95', 'distribution/q_99', 'type']

These data profiles can also be merged to future versions of your data, either to capture drift in your streaming ingestion service, as well as comparing static datasets over time. With the ability to merge profiles at the end of the calculation steps, you can log data in a distributed architecture.

But you might be asking yourself, if I keep merging these pieces of information, one day I'll start to pay the price of the heavy-weight files I create, right? Actually, not that much, and that's one of the main reasons whylogs is different. Because it’s incredibly efficient, even if your datasets scale to larger sizes, the profiles won’t lead to storage issues, and will also be profiled quickly. To show off whylogs’ performance, in the next section we will look at some benchmarks we ran.

Benchmarking data profiles

To better understand how much our profile grows as our profiled data grows, I have brought two different analyses that will clarify the impact on overall profile sizing with data change. First, we will compare how the profile grows when we increase the number of rows and columns, and then, we will see how fast we can profile a typical dataset.

Growing the number of rows

First, let's see what happens if we have a rather small dataset, with only four columns, 3 of them being random floats and the remaining column of the string type. We can define a function then make the data as the following code block:

import pandas as pd
 
def make_data(n_rows):
    df = pd.DataFrame(
        {
          "a": np.random.random(n_rows),
          "b": np.random.random(n_rows),
          "c": np.random.choice(["This is a very long string", "Short one", "123","345"], n_rows),
          "d": np.random.random(n_rows),
        }
    )
    return df

And for our analysis, we will call this exact function and repetitively grow the number of rows that goes into `n_rows`. By doing that, we are emulating a fixed-schema dataset that might grow in observations over time.

To profile and write the profile locally, you simply have to call:

profile = why.log(data)
profile.writer("local").write()

The results for this analysis were the following:

Interesting to see that after reaching a certain "plateau," the size will only grow a minimal amount - in the order of kilobytes - when compared to the growth in the number of rows. This enables us to store extremely small logs containing information that really matters in our datasets, without having to worry about storage costs or scalability. With whylogs, you can deep-dive into root causes of potential data drifts, and at the same time not have to worry about storage or compute costs, whether your data is small or if it scales.

This is possible because we benefit from Apache Data Sketches' solution for storing the information of our datasets accurately but lightly for each column. If you wish to learn more about their approach, you can check their comparison of KLL Sketches to Doubles Sketches for float-type data.

NOTE: After 10^7,  the dataset wouldn't fit into my local machine's memory, which happened for a 10Gb dataset, so I had to parallelize the same work with Dask, but this is out of the scope of this blogpost. If you want us to show how you could parallelize whylogs profiling, let us know in our community Slack.

Growing the number of columns

Now, with almost the same approach as we did for growing the number of rows, we will fix them and grow the data in the dimension of columns. We will go from 10 to 1000 and see what happens to the size of the profiles.

It's not straightforward to have this comprehension, but if we stop and think about it, it makes sense. Our profiling approach allows for data to grow extensively, always catching what is happening in every data point, without overloading our existing infrastructure or computer processing power. Even a dataset with 1000 columns, which is not quite common for regular tables, the profile size is extremely light and captures the essence of fluctuations in time very smoothly. For this example, we fixed the table size with 100k rows, and the largest dataset reached 1.6Gb, which is still ok for local work, but if the number of rows increased to 1Million, it would start to make it harder to handle it locally. We must consider that as the number of columns grows, the profile size will also grow linearly.

Profiling Speed

To give you an idea of how fast we profile data, we will start with a fairly large but still in-memory dataset. It has 15 columns and 1M rows and the DataFrame object has 120Mb. To measure its speed, we will simplify the analysis by only running a couple of time analysis loops, with Python's internal timeit package. We will get the largest time in 100 iterations, to see how fast the benchmark machine (a 2020 M1 Macbook Pro) can process this call, for the worst case scenario - competing with other processes.

import timeit
import numpy as np
import pandas as pd
import whylogs as why

def make_data(n_rows):
    data_dict = {}
    for number in range(15):
        data_dict[f"col_{number}"] = np.random.random(n_rows)
    return pd.DataFrame(data_dict)

data = make_data(n_rows=10 ** 6)

def profile_data():
    return why.log(data)

max_time = 0

for i in range(100):
    measure = timeit.timeit(profile_data, number=1)
    if measure > max_time:
        max_time = measure
print(max_time)

>> 0.667508

It takes at most 667 milliseconds to profile this specific 1M rows dataset on the benchmark laptop, giving us enough to know how much users can benefit from whylogs while at the same time not having to worry about compute costs for their entire pipeline.

When profiling the same dataset with 10 rows, we would have at most 4.5 milliseconds to profile. So for smaller chunks of data, profiling can be even faster. After profiling, users can also upload the profile to the Whylabs platform asynchronously, not having to worry about making the requests too heavy. If you want to read more about our latest performance improvements, check out our blog post.

Get started with data monitoring

In this blog post we talked about the importance of monitoring datasets, and learned how efficient whylogs can be. If you want to learn more or try whylogs for yourself, go to our examples page and check out our different use cases.

References

Other posts

Model Monitoring for Financial Fraud Classification

Model monitoring is helping the financial services industry avoid huge losses caused by performance degradation in their fraud transaction models.

Data and ML Monitoring is Easier with whylogs v1.1

The release of whylogs v1.1 brings many features to the whylogs data logging API, making it even easier to monitor your data and ML models!

Robust & Responsible AI Newsletter - Issue #3

Every quarter we send out a roundup of the hottest MLOps and Data-Centric AI news including industry highlights, what’s brewing at WhyLabs, and more.

Data Quality Monitoring in Apache Airflow with whylogs

To make the most of whylogs within your existing Apache Airflow pipelines, we’ve created the whylogs Airflow provider. Using an example, we’ll show how you can use whylogs and Airflow to make your workflow more responsible, scalable, and efficient.

Data Quality Monitoring for Kafka, Beyond Schema Validation

Data quality mapped to a schema registry or data type validation is a good start, but there are a few things most data application owners don’t think about. We explore error scenarios beyond schema validation and how to mitigate them.

Data + Model Monitoring with WhyLabs: simple, customizable, actionable

The new monitoring system maximizes the helpfulness of alerts and minimizes alert fatigue, so users can focus on improving their models instead of worrying about them in production...

A Solution for Monitoring Image Data

A breakdown of how to monitor unstructured data such as images, the types of problems that threaten computer vision systems, and a solution for these challenges.

How to Validate Data Quality for ML Monitoring

Data quality is one of the most important considerations for machine learning applications—and it's one of the most frequently overlooked. We explore why it’s an essential step in the MLOps process and how to check your data quality with whylogs.

Small Changes for Big SQLite Performance Increases

A behind-the-scenes look at how the WhyLabs engineering team improved SQLite performance to make monitoring data and machine learning models faster and easier for whylogs users.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo
loading...