Data Logging with whylogs: Profiling for Efficiency and Speed
- Whylogs
- Open Source
- ML Monitoring
Aug 31, 2022
With whylogs, the open standard for data logging, users can capture different aspects of how data and models look through snapshots, which we call data profiles. These profiles are mergeable and efficient, which we will focus on in this post. If you want to learn more or dive into whylogs before reading on, check out our documentation.
Logging data using profiles
When you log your data with whylogs, you get a data profile, a collection of metrics representing the current state of a dataset. It will carry only the necessary metadata that you need to answer questions like:
- What is the maximum value of a column?
- Is this column's value always between 0 and 100?
- What is my null percentage for the entire dataset?
By creating lightweight profiles, we can answer these typical questions efficiently, without worrying too much about compute costs or storage.
We create data profiles instead of sampling because we look at every data point to decide whether it matters or not for the metric calculation (such as mean, maximum, etc.). With sampling, you would lose information on neglected points, which could be outliers or contain something specific you would need to run a constraint logic on.
If we decided to run a full scan of the data and copy it over somewhere else, it would get too expensive, both in terms of computing and storage costs, and it'd also take too long to make our sanity checks regularly. With profiling, we can calculate statistical information on our datasets quickly and create light representations of the data while still maintaining accuracy.
To create a profile, all you have to do is install whylogs and run:
import whylogs as why
results = why.log(df)
Once that's done, you can create a Profile View and inspect how the data you've just profiled looks, and even turn it into a pandas DataFrame for a better look into the details:
profile_view = results.view()
profile_view.to_pandas()
>>> counts/n counts/null types/integral types/fractional types/boolean ... distribution/q_75 distribution/q_90 distribution/q_95 distribution/q_99 type
column ...
col_1 100 0 0 100 0 ... 0.741651 0.862725 0.930534 0.987908 SummaryType.COLUMN
col_2 100 0 0 100 0 ...
0.713022 0.889486 0.911135 0.993780 SummaryType.COLUMN
[2 rows x 25 columns]
And looking at every column of this pandas DataFrame, we get:
>>> ['counts/n', 'counts/null', 'types/integral', 'types/fractional', 'types/boolean', 'types/string', 'types/object', 'cardinality/est', 'cardinality/upper_1', 'cardinality/lower_1', 'distribution/mean', 'distribution/stddev', 'distribution/n', 'distribution/max', 'distribution/min', 'distribution/q_01', 'distribution/q_05', 'distribution/q_10', 'distribution/q_25', 'distribution/median', 'distribution/q_75', 'distribution/q_90', 'distribution/q_95', 'distribution/q_99', 'type']
These data profiles can also be merged to future versions of your data, either to capture drift in your streaming ingestion service, as well as comparing static datasets over time. With the ability to merge profiles at the end of the calculation steps, you can log data in a distributed architecture.
But you might be asking yourself, if I keep merging these pieces of information, one day I'll start to pay the price of the heavy-weight files I create, right? Actually, not that much, and that's one of the main reasons whylogs is different. Because it’s incredibly efficient, even if your datasets scale to larger sizes, the profiles won’t lead to storage issues, and will also be profiled quickly. To show off whylogs’ performance, in the next section we will look at some benchmarks we ran.
Benchmarking data profiles
To better understand how much our profile grows as our profiled data grows, I have brought two different analyses that will clarify the impact on overall profile sizing with data change. First, we will compare how the profile grows when we increase the number of rows and columns, and then, we will see how fast we can profile a typical dataset.
Growing the number of rows
First, let's see what happens if we have a rather small dataset, with only four columns, 3 of them being random floats and the remaining column of the string type. We can define a function then make the data as the following code block:
import pandas as pd
def make_data(n_rows):
df = pd.DataFrame(
{
"a": np.random.random(n_rows),
"b": np.random.random(n_rows),
"c": np.random.choice(["This is a very long string", "Short one", "123","345"], n_rows),
"d": np.random.random(n_rows),
}
)
return df
And for our analysis, we will call this exact function and repetitively grow the number of rows that goes into `n_rows`. By doing that, we are emulating a fixed-schema dataset that might grow in observations over time.
To profile and write the profile locally, you simply have to call:
profile = why.log(data)
profile.writer("local").write()
The results for this analysis were the following:
Interesting to see that after reaching a certain "plateau," the size will only grow a minimal amount - in the order of kilobytes - when compared to the growth in the number of rows. This enables us to store extremely small logs containing information that really matters in our datasets, without having to worry about storage costs or scalability. With whylogs, you can deep-dive into root causes of potential data drifts, and at the same time not have to worry about storage or compute costs, whether your data is small or if it scales.
This is possible because we benefit from Apache Data Sketches' solution for storing the information of our datasets accurately but lightly for each column. If you wish to learn more about their approach, you can check their comparison of KLL Sketches to Doubles Sketches for float-type data.
NOTE: After 10^7, the dataset wouldn't fit into my local machine's memory, which happened for a 10Gb dataset, so I had to parallelize the same work with Dask, but this is out of the scope of this blogpost. If you want us to show how you could parallelize whylogs profiling, let us know in our community Slack.
Growing the number of columns
Now, with almost the same approach as we did for growing the number of rows, we will fix them and grow the data in the dimension of columns. We will go from 10 to 1000 and see what happens to the size of the profiles.
It's not straightforward to have this comprehension, but if we stop and think about it, it makes sense. Our profiling approach allows for data to grow extensively, always catching what is happening in every data point, without overloading our existing infrastructure or computer processing power. Even a dataset with 1000 columns, which is not quite common for regular tables, the profile size is extremely light and captures the essence of fluctuations in time very smoothly. For this example, we fixed the table size with 100k rows, and the largest dataset reached 1.6Gb, which is still ok for local work, but if the number of rows increased to 1Million, it would start to make it harder to handle it locally. We must consider that as the number of columns grows, the profile size will also grow linearly.
Profiling Speed
To give you an idea of how fast we profile data, we will start with a fairly large but still in-memory dataset. It has 15 columns and 1M rows and the DataFrame object has 120Mb. To measure its speed, we will simplify the analysis by only running a couple of time analysis loops, with Python's internal timeit package. We will get the largest time in 100 iterations, to see how fast the benchmark machine (a 2020 M1 Macbook Pro) can process this call, for the worst case scenario - competing with other processes.
import timeit
import numpy as np
import pandas as pd
import whylogs as why
def make_data(n_rows):
data_dict = {}
for number in range(15):
data_dict[f"col_{number}"] = np.random.random(n_rows)
return pd.DataFrame(data_dict)
data = make_data(n_rows=10 ** 6)
def profile_data():
return why.log(data)
max_time = 0
for i in range(100):
measure = timeit.timeit(profile_data, number=1)
if measure > max_time:
max_time = measure
print(max_time)
>> 0.667508
It takes at most 667 milliseconds to profile this specific 1M rows dataset on the benchmark laptop, giving us enough to know how much users can benefit from whylogs while at the same time not having to worry about compute costs for their entire pipeline.
When profiling the same dataset with 10 rows, we would have at most 4.5 milliseconds to profile. So for smaller chunks of data, profiling can be even faster. After profiling, users can also upload the profile to the Whylabs platform asynchronously, not having to worry about making the requests too heavy. If you want to read more about our latest performance improvements, check out our blog post.
Get started with data monitoring
In this blog post we talked about the importance of monitoring datasets, and learned how efficient whylogs can be. If you want to learn more or try whylogs for yourself, go to our examples page and check out our different use cases.
References
Other posts
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI