blog bg left
Back to Blog

Data Logging with whylogs: Profiling for Efficiency and Speed

With whylogs, the open standard for data logging, users can capture different aspects of how data and models look through snapshots, which we call data profiles. These profiles are mergeable and efficient, which we will focus on in this post. If you want to learn more or dive into whylogs before reading on, check out our documentation.

Logging data using profiles

When you log your data with whylogs, you get a data profile, a collection of metrics representing the current state of a dataset. It will carry only the necessary metadata that you need to answer questions like:

  • What is the maximum value of a column?
  • Is this column's value always between 0 and 100?
  • What is my null percentage for the entire dataset?

By creating lightweight profiles, we can answer these typical questions efficiently, without worrying too much about compute costs or storage.

We create data profiles instead of sampling because we look at every data point to decide whether it matters or not for the metric calculation (such as mean, maximum, etc.). With  sampling, you would lose information on neglected points, which could be outliers or contain something specific you would need to run a constraint logic on.

If we decided to run a full scan of the data and copy it over somewhere else, it would get too expensive, both in terms of computing and storage costs, and it'd also take too long to make our sanity checks regularly. With profiling, we can calculate statistical information on our datasets quickly and create light representations of the data while still maintaining accuracy.

To create a profile, all you have to do is install whylogs and run:

import whylogs as why
results = why.log(df)

Once that's done, you can create a Profile View and inspect how the data you've just profiled looks, and even turn it into a pandas DataFrame for a better look into the details:

profile_view = results.view()
>>>        counts/n  counts/null  types/integral  types/fractional  types/boolean  ...  distribution/q_75  distribution/q_90  distribution/q_95  distribution/q_99                type
column                                                                          ...                                                                                                
col_1        100            0               0               100              0  ...           0.741651           0.862725           0.930534           0.987908  SummaryType.COLUMN
col_2        100            0               0               100              0  ...           
0.713022           0.889486           0.911135           0.993780  SummaryType.COLUMN
[2 rows x 25 columns]

And looking at every column of this pandas DataFrame, we get:

>>> ['counts/n', 'counts/null', 'types/integral', 'types/fractional', 'types/boolean', 'types/string', 'types/object', 'cardinality/est', 'cardinality/upper_1', 'cardinality/lower_1', 'distribution/mean', 'distribution/stddev', 'distribution/n', 'distribution/max', 'distribution/min', 'distribution/q_01', 'distribution/q_05', 'distribution/q_10', 'distribution/q_25', 'distribution/median', 'distribution/q_75', 'distribution/q_90', 'distribution/q_95', 'distribution/q_99', 'type']

These data profiles can also be merged to future versions of your data, either to capture drift in your streaming ingestion service, as well as comparing static datasets over time. With the ability to merge profiles at the end of the calculation steps, you can log data in a distributed architecture.

But you might be asking yourself, if I keep merging these pieces of information, one day I'll start to pay the price of the heavy-weight files I create, right? Actually, not that much, and that's one of the main reasons whylogs is different. Because it’s incredibly efficient, even if your datasets scale to larger sizes, the profiles won’t lead to storage issues, and will also be profiled quickly. To show off whylogs’ performance, in the next section we will look at some benchmarks we ran.

Benchmarking data profiles

To better understand how much our profile grows as our profiled data grows, I have brought two different analyses that will clarify the impact on overall profile sizing with data change. First, we will compare how the profile grows when we increase the number of rows and columns, and then, we will see how fast we can profile a typical dataset.

Growing the number of rows

First, let's see what happens if we have a rather small dataset, with only four columns, 3 of them being random floats and the remaining column of the string type. We can define a function then make the data as the following code block:

import pandas as pd
def make_data(n_rows):
    df = pd.DataFrame(
          "a": np.random.random(n_rows),
          "b": np.random.random(n_rows),
          "c": np.random.choice(["This is a very long string", "Short one", "123","345"], n_rows),
          "d": np.random.random(n_rows),
    return df

And for our analysis, we will call this exact function and repetitively grow the number of rows that goes into `n_rows`. By doing that, we are emulating a fixed-schema dataset that might grow in observations over time.

To profile and write the profile locally, you simply have to call:

profile = why.log(data)

The results for this analysis were the following:

Interesting to see that after reaching a certain "plateau," the size will only grow a minimal amount - in the order of kilobytes - when compared to the growth in the number of rows. This enables us to store extremely small logs containing information that really matters in our datasets, without having to worry about storage costs or scalability. With whylogs, you can deep-dive into root causes of potential data drifts, and at the same time not have to worry about storage or compute costs, whether your data is small or if it scales.

This is possible because we benefit from Apache Data Sketches' solution for storing the information of our datasets accurately but lightly for each column. If you wish to learn more about their approach, you can check their comparison of KLL Sketches to Doubles Sketches for float-type data.

NOTE: After 10^7,  the dataset wouldn't fit into my local machine's memory, which happened for a 10Gb dataset, so I had to parallelize the same work with Dask, but this is out of the scope of this blogpost. If you want us to show how you could parallelize whylogs profiling, let us know in our community Slack.

Growing the number of columns

Now, with almost the same approach as we did for growing the number of rows, we will fix them and grow the data in the dimension of columns. We will go from 10 to 1000 and see what happens to the size of the profiles.

It's not straightforward to have this comprehension, but if we stop and think about it, it makes sense. Our profiling approach allows for data to grow extensively, always catching what is happening in every data point, without overloading our existing infrastructure or computer processing power. Even a dataset with 1000 columns, which is not quite common for regular tables, the profile size is extremely light and captures the essence of fluctuations in time very smoothly. For this example, we fixed the table size with 100k rows, and the largest dataset reached 1.6Gb, which is still ok for local work, but if the number of rows increased to 1Million, it would start to make it harder to handle it locally. We must consider that as the number of columns grows, the profile size will also grow linearly.

Profiling Speed

To give you an idea of how fast we profile data, we will start with a fairly large but still in-memory dataset. It has 15 columns and 1M rows and the DataFrame object has 120Mb. To measure its speed, we will simplify the analysis by only running a couple of time analysis loops, with Python's internal timeit package. We will get the largest time in 100 iterations, to see how fast the benchmark machine (a 2020 M1 Macbook Pro) can process this call, for the worst case scenario - competing with other processes.

import timeit
import numpy as np
import pandas as pd
import whylogs as why

def make_data(n_rows):
    data_dict = {}
    for number in range(15):
        data_dict[f"col_{number}"] = np.random.random(n_rows)
    return pd.DataFrame(data_dict)

data = make_data(n_rows=10 ** 6)

def profile_data():
    return why.log(data)

max_time = 0

for i in range(100):
    measure = timeit.timeit(profile_data, number=1)
    if measure > max_time:
        max_time = measure

>> 0.667508

It takes at most 667 milliseconds to profile this specific 1M rows dataset on the benchmark laptop, giving us enough to know how much users can benefit from whylogs while at the same time not having to worry about compute costs for their entire pipeline.

When profiling the same dataset with 10 rows, we would have at most 4.5 milliseconds to profile. So for smaller chunks of data, profiling can be even faster. After profiling, users can also upload the profile to the Whylabs platform asynchronously, not having to worry about making the requests too heavy. If you want to read more about our latest performance improvements, check out our blog post.

Get started with data monitoring

In this blog post we talked about the importance of monitoring datasets, and learned how efficient whylogs can be. If you want to learn more or try whylogs for yourself, go to our examples page and check out our different use cases.


Other posts

Glassdoor Decreases Latency Overhead and Improves Data Monitoring with WhyLabs

The Glassdoor team describes their integration latency challenges and how they were able to decrease latency overhead and improve data monitoring with WhyLabs.

Understanding and Monitoring Embeddings in Amazon SageMaker with WhyLabs

WhyLabs and Amazon Web Services (AWS) explore the various ways embeddings are used, issues that can impact your ML models, how to identify those issues and set up monitors to prevent them in the future!

Data Drift Monitoring and Its Importance in MLOps

It's important to continuously monitor and manage ML models to ensure ML model performance. We explore the role of data drift management and why it's crucial in your MLOps pipeline.

Ensuring AI Success in Healthcare: The Vital Role of ML Monitoring

Discover how ML monitoring plays a crucial role in the Healthcare industry to ensure the reliability, compliance, and overall safety of AI-driven systems.

WhyLabs Recognized by CB Insights GenAI 50 among the Most Innovative Generative AI Startups

WhyLabs has been named on CB Insights’ first annual GenAI 50 list, named as one of the world’s top 50 most innovative companies developing generative AI applications and infrastructure across industries.

Hugging Face and LangKit: Your Solution for LLM Observability

See how easy it is to generate out-of-the-box text metrics for Hugging Face LLMs and monitor them in WhyLabs to identify how model performance and user interaction are changing over time.

7 Ways to Monitor Large Language Model Behavior

Discover seven ways to track and monitor Large Language Model behavior using metrics for ChatGPT’s responses for a fixed set of 200 prompts across 35 days.

Safeguarding and Monitoring Large Language Model (LLM) Applications

We explore the concept of observability and validation in the context of language models, and demonstrate how to effectively safeguard them using guardrails.

Robust & Responsible AI Newsletter - Issue #6

A quarterly roundup of the hottest LLM, ML and Data-Centric AI news, including industry highlights, what’s brewing at WhyLabs, and more.
pre footer decoration
pre footer decoration
pre footer decoration

Run AI With Certainty

Book a demo