WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

Murilo Mendonca

Aug 31, 2022

Back to Blog

Data Logging with whylogs: Profiling for Efficiency and Speed

Whylogs
Open Source
ML Monitoring

Murilo Mendonca

Aug 31, 2022

With whylogs, the open standard for data logging, users can capture different aspects of how data and models look through snapshots, which we call data profiles. These profiles are mergeable and efficient, which we will focus on in this post. If you want to learn more or dive into whylogs before reading on, check out our documentation.

Logging data using profiles

When you log your data with whylogs, you get a data profile, a collection of metrics representing the current state of a dataset. It will carry only the necessary metadata that you need to answer questions like:

What is the maximum value of a column?
Is this column's value always between 0 and 100?
What is my null percentage for the entire dataset?

By creating lightweight profiles, we can answer these typical questions efficiently, without worrying too much about compute costs or storage.

We create data profiles instead of sampling because we look at every data point to decide whether it matters or not for the metric calculation (such as mean, maximum, etc.). With sampling, you would lose information on neglected points, which could be outliers or contain something specific you would need to run a constraint logic on.

If we decided to run a full scan of the data and copy it over somewhere else, it would get too expensive, both in terms of computing and storage costs, and it'd also take too long to make our sanity checks regularly. With profiling, we can calculate statistical information on our datasets quickly and create light representations of the data while still maintaining accuracy.

To create a profile, all you have to do is install whylogs and run:

import whylogs as why
 
results = why.log(df)

Once that's done, you can create a Profile View and inspect how the data you've just profiled looks, and even turn it into a pandas DataFrame for a better look into the details:

profile_view = results.view()
profile_view.to_pandas()
>>>        counts/n  counts/null  types/integral  types/fractional  types/boolean  ...  distribution/q_75  distribution/q_90  distribution/q_95  distribution/q_99                type
column                                                                          ...                                                                                                
col_1        100            0               0               100              0  ...           0.741651           0.862725           0.930534           0.987908  SummaryType.COLUMN
col_2        100            0               0               100              0  ...           
0.713022           0.889486           0.911135           0.993780  SummaryType.COLUMN
 
[2 rows x 25 columns]

And looking at every column of this pandas DataFrame, we get:

>>> ['counts/n', 'counts/null', 'types/integral', 'types/fractional', 'types/boolean', 'types/string', 'types/object', 'cardinality/est', 'cardinality/upper_1', 'cardinality/lower_1', 'distribution/mean', 'distribution/stddev', 'distribution/n', 'distribution/max', 'distribution/min', 'distribution/q_01', 'distribution/q_05', 'distribution/q_10', 'distribution/q_25', 'distribution/median', 'distribution/q_75', 'distribution/q_90', 'distribution/q_95', 'distribution/q_99', 'type']

These data profiles can also be merged to future versions of your data, either to capture drift in your streaming ingestion service, as well as comparing static datasets over time. With the ability to merge profiles at the end of the calculation steps, you can log data in a distributed architecture.

But you might be asking yourself, if I keep merging these pieces of information, one day I'll start to pay the price of the heavy-weight files I create, right? Actually, not that much, and that's one of the main reasons whylogs is different. Because it’s incredibly efficient, even if your datasets scale to larger sizes, the profiles won’t lead to storage issues, and will also be profiled quickly. To show off whylogs’ performance, in the next section we will look at some benchmarks we ran.

Benchmarking data profiles

To better understand how much our profile grows as our profiled data grows, I have brought two different analyses that will clarify the impact on overall profile sizing with data change. First, we will compare how the profile grows when we increase the number of rows and columns, and then, we will see how fast we can profile a typical dataset.

Growing the number of rows

First, let's see what happens if we have a rather small dataset, with only four columns, 3 of them being random floats and the remaining column of the string type. We can define a function then make the data as the following code block:

import pandas as pd
 
def make_data(n_rows):
    df = pd.DataFrame(
        {
          "a": np.random.random(n_rows),
          "b": np.random.random(n_rows),
          "c": np.random.choice(["This is a very long string", "Short one", "123","345"], n_rows),
          "d": np.random.random(n_rows),
        }
    )
    return df

And for our analysis, we will call this exact function and repetitively grow the number of rows that goes into `n_rows`. By doing that, we are emulating a fixed-schema dataset that might grow in observations over time.

To profile and write the profile locally, you simply have to call:

profile = why.log(data)
profile.writer("local").write()

The results for this analysis were the following:

Interesting to see that after reaching a certain "plateau," the size will only grow a minimal amount - in the order of kilobytes - when compared to the growth in the number of rows. This enables us to store extremely small logs containing information that really matters in our datasets, without having to worry about storage costs or scalability. With whylogs, you can deep-dive into root causes of potential data drifts, and at the same time not have to worry about storage or compute costs, whether your data is small or if it scales.

This is possible because we benefit from Apache Data Sketches' solution for storing the information of our datasets accurately but lightly for each column. If you wish to learn more about their approach, you can check their comparison of KLL Sketches to Doubles Sketches for float-type data.

NOTE: After 10^7, the dataset wouldn't fit into my local machine's memory, which happened for a 10Gb dataset, so I had to parallelize the same work with Dask, but this is out of the scope of this blogpost. If you want us to show how you could parallelize whylogs profiling, let us know in our community Slack.

Growing the number of columns

Now, with almost the same approach as we did for growing the number of rows, we will fix them and grow the data in the dimension of columns. We will go from 10 to 1000 and see what happens to the size of the profiles.

It's not straightforward to have this comprehension, but if we stop and think about it, it makes sense. Our profiling approach allows for data to grow extensively, always catching what is happening in every data point, without overloading our existing infrastructure or computer processing power. Even a dataset with 1000 columns, which is not quite common for regular tables, the profile size is extremely light and captures the essence of fluctuations in time very smoothly. For this example, we fixed the table size with 100k rows, and the largest dataset reached 1.6Gb, which is still ok for local work, but if the number of rows increased to 1Million, it would start to make it harder to handle it locally. We must consider that as the number of columns grows, the profile size will also grow linearly.

Profiling Speed

To give you an idea of how fast we profile data, we will start with a fairly large but still in-memory dataset. It has 15 columns and 1M rows and the DataFrame object has 120Mb. To measure its speed, we will simplify the analysis by only running a couple of time analysis loops, with Python's internal timeit package. We will get the largest time in 100 iterations, to see how fast the benchmark machine (a 2020 M1 Macbook Pro) can process this call, for the worst case scenario - competing with other processes.

import timeit
import numpy as np
import pandas as pd
import whylogs as why

def make_data(n_rows):
    data_dict = {}
    for number in range(15):
        data_dict[f"col_{number}"] = np.random.random(n_rows)
    return pd.DataFrame(data_dict)

data = make_data(n_rows=10 ** 6)

def profile_data():
    return why.log(data)

max_time = 0

for i in range(100):
    measure = timeit.timeit(profile_data, number=1)
    if measure > max_time:
        max_time = measure
print(max_time)

>> 0.667508

It takes at most 667 milliseconds to profile this specific 1M rows dataset on the benchmark laptop, giving us enough to know how much users can benefit from whylogs while at the same time not having to worry about compute costs for their entire pipeline.

When profiling the same dataset with 10 rows, we would have at most 4.5 milliseconds to profile. So for smaller chunks of data, profiling can be even faster. After profiling, users can also upload the profile to the Whylabs platform asynchronously, not having to worry about making the requests too heavy. If you want to read more about our latest performance improvements, check out our blog post.

Get started with data monitoring

In this blog post we talked about the importance of monitoring datasets, and learned how efficient whylogs can be. If you want to learn more or try whylogs for yourself, go to our examples page and check out our different use cases.

References

Murilo Mendonca

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

How to Evaluate and Improve RAG Applications for Safe Production Deployment

Rich Young

Jul 17, 2024

Learn how to evaluate and improve RAG applications using LangKit and WhyLabs AI Control Center. Develop secure and reliable RAG applications.

Read post

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

Data Logging with whylogs: Profiling for Efficiency and Speed

Logging data using profiles

Benchmarking data profiles

Growing the number of rows

Growing the number of columns

Profiling Speed

Get started with data monitoring

References

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

How to Evaluate and Improve RAG Applications for Safe Production Deployment

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty

About

Resources

whylogs

WhyLabs