WhyLabs AI Control Center (also known as the WhyLabs Platform) is now an open source project!

Felipe Adachi

Dec 21, 2022

Back to Blog

Understanding Kolmogorov-Smirnov (KS) Tests for Data Drift on Profiled Data

Data Quality
ML Monitoring

Felipe Adachi

Dec 21, 2022

TLDR: We experimented with statistical tests, Kolmogorov-Smirnov (KS) specifically, applied to full datasets as well as dataset profiles and compared results. The results allow us to discuss the limitations of data profiling for KS drift detection and the pros and cons of the KS algorithm for different scenarios. We also provide the code for you to reproduce the experiments yourself.

Data drift is a well-known issue in ML applications. If unaddressed, it can degrade your model significantly and make your model downright unusable. The first step to address those issues is to be able to detect and monitor for data drift.

There are multiple approaches to monitoring data drift in production. It is very common to use statistical tests to get a drift detection value and monitor it over time. Traditional drift detection algorithms usually need the full original data to calculate these values, but for large-scale systems, having complete access to historical data might be infeasible due to storage or privacy concerns. A possible alternative is to sample your data beforehand, which also comes with disadvantages: you might lose important information, such as rare events and outliers, through aggregation, damaging the result’s reliability.

A third approach is to profile your data before applying your drift detection algorithm. Profiles capture key statistical properties of data, such as distribution metrics, frequent items, missing values, and much more. We can then use those statistical properties to apply adapted versions of drift detection techniques. Of course, since there is no such thing as a free lunch, this strategy has its downsides. A profile is an estimate of the original data and, as such, using it for drift detection will generate approximations of the actual drift detection value that you would get if you had used the complete data.

But how exactly does the profiling process work with the drift detection algorithms, and how much do we lose by doing it?

In this blog post, we’ll limit ourselves to numerical univariate distributions, and choose one specific algorithm to run the experiment: the Kolmogorov-Smirnov (KS) test. We’ll also get to have some nice insights into the suitability of the KS test itself for different scenarios.

Here’s what we’ll cover in this blog post:

What is the KS test?
What is data profiling?
Experiment Design
The Experiments
- Experiment #1 — Data volume
- Experiment #2 — No. of buckets
- Experiment #3 — Profile size

You can check the code used in this blog post or even run the experiment yourself by accessing the experiment’s Google Colab notebook.

The Kolmogorov-Smirnov Test

The KS test is a test of the equality between two one-dimensional probability distributions. It can be used to compare a sample with a reference probability distribution or compare two samples. Right now, we are interested in the latter. When comparing two samples, we are trying to answer the following question:

“What is the probability that these two sets of samples were drawn from the same probability distribution?”

The KS test is nonparametric, which means we don’t need to rely on assumptions that the data are drawn from a given family of distributions. This is good, since we often won’t know the underlying distribution beforehand in the real world.

The statistic

The KS statistic can be expressed as:

D = supₓ|F₁(x) — F₂(x)|

where F1 and F2 are the two cumulative distribution functions of the first and second samples, respectively.

Another way to put it is that the KS statistic is the maximum absolute difference between the two cumulative distributions.

The image below shows an example of the statistic, depicted as a black arrow.

The two-sample KS statistic. Source: Wikipedia[1]

The Null Hypothesis

The null hypothesis used in this experiment is that both samples are drawn from the same distribution. For instance, a small p-value would indicate that the data is unlikely if all assumptions defining our statistical model are true (including our test hypothesis). In other words, we can interpret the p-value as a measure of compatibility between the data and the underlying assumptions that define our statistical model, with 0 representing complete incompatibility and 1 representing complete compatibility*[2].

To calculate this value, the KS statistic is taken into account along with the sample size of both distributions. Typical thresholds for rejecting the null hypothesis are 1% and 5%, implying that any p-value less than or equal to these values would lead to the rejection of the null hypothesis.

(*) Errata by the author — The original sentence in the The Null Hypothesis section was: “For example, a p-value of 0.05 would mean a 5% probability of both samples being from the same distribution.” This is a misconception and does not represent the correct definition of the p-value, as stated in the paper Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations[2].

Data Profiling

Profiling a dataset means collecting statistical measurements of the data. This enables us to generate statistical fingerprints, or summaries, of our data in a scalable, lightweight, and flexible manner. Rare events and outlier-dependent metrics can be accurately captured.

To profile our data, we’ll use the open-source data logging library whylogs. Profiling with whylogs is done in a streaming fashion, requiring a single pass over the data, and allows for parallelization. Profiles are also mergeable, allowing you to inspect your data across multiple computing instances, time periods, or geographic locations. This is made possible with a technique called sketching, pioneered by Apache DataSketches.

Precisely for this example, we’ll leverage the profile’s distribution metrics. To calculate the KS statistic, we need to generate an approximation for the sample’s cumulative distribution function. This is made possible with a technique called data sketching, pioneered by Apache DataSketches.

Experiment Design

First, we need the data. For this experiment, we will take two samples of equal size from the following distributions:

Normal: Broad class of data. Unskewed and peaked around the center
Pareto: Skewed data with long tail/outliers
Uniform: Evenly sampled across its domain

In this blog post, we’ll show the results for normal distribution only, but you can find the same experiments for Pareto and Uniform distributions directly in the example notebook here. The overall conclusions drawn from the normal distribution case can also be applied to the remaining distributions.

Drift Injection

Next, we’ll inject drift into one sample (which we’ll call the target distribution) to compare it to the reference, unaltered, distribution.

We will inject drift artificially by simply shifting the data’s mean according to a parameter. We chose to use the ratio of the distribution’s interquartile range. Here’s what it looks like for the normal distribution case:

The idea is to have four different scenarios: no drift, small drift, medium drift, and large drift. The magnitude classification and the ideal process of detecting/alerting for drifts can be very subjective, depending on the desired sensitivity for your particular application. In this case, we are assuming that the small-drift scenario is small enough for it to be safely ignored. We are also expecting that the medium and large drift scenarios should result in a drift alert since both would be cases for further inspection.

Applying the KS test

As the ground truth, we will use scipy’s implementation of the two-sample KS test with the complete data from both samples. We will then compare those results with the profiled version of the test. To do so, we’ll use whylogs’ approximate implementation of the same test, which uses only the statistical profile of each sample.

The distribution metrics contained in the profiles are obtained from a process called sketching, which gives them many useful properties but adds some amount of error to the result. For this reason, the KS test result can be different each time a profile is generated. We’ll profile the data 10 times for every scenario, and compare the ground truth to statistics such as the mean, maximum, and minimum of those runs.

Experiment Variables

Our main goal is to answer:

“How does whylogs’ KS implementation compare to scipy’s implementation?”

However, this answer depends on several different variables. We will run three separate experiments to better understand the effect of each variable: data volume, number of buckets, and profile size. The first one relates to the number of data points in each sample, whereas the last two relate to whylogs internal, tunable parameters.

Experiment #1 — Data Volume

The number of data points in a sample affects not only the KS test in general but also the profiling process itself. It is reasonable, then, to investigate how it affects the results.

We compared the p-values for both implementations with varying sample sizes (for both target and reference distributions): 500, 1k, 5k, 10k, and 50k.

You’ll notice that we don’t have error bars for the ground truth. For a given sample size and drift magnitude, scipy’s result is deterministic, since we’re always using the complete data, whereas, for whylogs, the error bars represent the maximum and minimum values found in the 10 runs.

Note that, for medium and large drift cases, both y-axis are really close to 0, so even for a sample size of 500, both implementations result in a p-value of effectively 0, indicating that our data is highly incompatible with the null hypothesis. For the no drift and small drift scenarios, we can see that both implementations yield very similar results when comparing the mean p-value of the sketch-based implementation, but with some difference for specific runs, especially for large sized samples. However, for almost all cases, the ground truth lies somewhere in between the range of the profiled case. It is also worth noting that, at a 95% confidence interval, both implementations would yield the same conclusion for all points in all scenarios.

KS test is really sensitive, and its sensitivity increases according to the sample size: in the small-drift scenario, for sample sizes greater than or equal to 5k, we reject the null hypothesis. Even though this is not technically wrong, we initially considered this case to be so small that it could be safely ignored.

At this point, we should ask ourselves whether this test is actually telling us what we care about. A p-value smaller than 0.05 would lead to rejecting the null hypothesis, but it doesn’t tell us anything about the effect size. In other words, it tells us that there is a difference, but not how much of a difference there is. There might be statistical significance, but not an actual practical significance to it.

Experiment #2 — No. of Buckets

To get a discrete cumulative distribution, we first need to define the number of buckets. The sketch-based KS test will then use those buckets to calculate the statistics. We will run experiments with equally spaced bins of sizes: 5, 10, 50, and 100. For each of the 10 runs, we will calculate the absolute error between the exact and the sketch-based whylogs’ implementation and plot the mean, along with the error bars representing the minimum and maximum errors found. We will show those errors according to the sample size and drift magnitude, just like in the previous experiment.

whylogs’ current version has 100 as the default number of buckets, and that is also the value used in the previously shown results.

Since some of the values in the graphs are much higher than the remaining ones, we’re breaking the y axis in some cases, to better visualize all the bars in the plot. Even so, some of the bars are still too small to be seen. The errors for the medium and large drift scenarios are very close to 0, meaning that both implementations get similar results.

Overall, the error’s mean seems to decrease when increasing the number of buckets. However, the variance of the errors increases for higher sample sizes, which is due to the increasing estimation errors in the profiling process.

The experiments so far show some degree of randomness for the no-drift scenario, for both implementations. Since the KS test relies solely on the maximum absolute difference between distributions, any slight changes resulting from the sampling process will affect the no-drift scenario.

Experiment #3 — Profile Size

As previously stated, in a profile we have an approximate distribution in the format of a data sketch. A data sketch is configured with a parameter K, which dictates the profile’s size and its estimation error [3]. The higher this parameter, the lower the estimation error will be. All of the previous experiments were run with a K=1024, but now we want to see how the errors get affected with varying numbers of K.

This time, we will fix the sample size to 100k and the number of buckets to 100 and vary the K parameter to the following values: 256, 512, 1024, 2048, and 4096.

We have omitted charts for drift sizes 0.4 and 0.75 due to the consistently small amounts of errors making visualization unnecessary.

The X-axis is shown according to the profile’s size when serialized: K values of 256, 512, 1024, 2048, and 4096 yield approximate profile sizes of 6KB, 11KB, 22KB, 43KB, and 83KB, respectively.

As seen before, any drifted scenario just shows how sensitive the KS test is. The bars can’t be seen for the medium and large drift scenarios because their values are effectively 0, but in the no-drift scenario we can see that the error is inversely proportional to the profile size, and by extension to the K parameter. By increasing K, the errors due to profiling decrease, approximating both implementation’s results.

We can also verify that, for this scenario, the errors are quite small. But if we are interested in minimizing those errors, we can sacrifice profile space for better results.

Conclusion

Let’s summarize some key takeaways from these experiments:

Performing the KS test on data profiles is possible, and the results are very close to the standard implementation. However, the results are non-deterministic.
KS test is very sensitive, and it tends to get even more sensitive with higher sample sizes. When testing solely under the null hypothesis, it might tell us something about the difference between distributions, but is insensitive to how much of a difference there is.
We can tune internal parameters for better results for the whylogs’ implementation. In particular, we can increase the profile size to get results closer to the ground truth.

We hope this helps in building intuition on how the KS test works along with data profiling. We also got to have a better understanding of KS test limitations. Motivated by that, we at whylogs are already implementing additional similarity measures! For instance, Hellinger distance is already implemented in whylogs, so stay tuned for more experiments and benchmarks!

Thank you for reading, and feel free to reach out if you have any questions/suggestions! If you’re interested in exploring whylogs in your projects, consider joining our Slack community to get support and also share feedback!

“Understanding Kolmogorov-Smirnov (KS) Tests for Data Drift on Profiled Data” was originally published by Towards Data Science.

References

[1] — Kolmogorov–Smirnov test. (2022, October 29). In Wikipedia. https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

[2] — Greenland, S., Senn, S.J., Rothman, K.J. et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 31, 337–350 (2016).

[3] — Karnin, Z., Lang, K., & Liberty, E. (2016). Optimal Quantile Approximation in Streams. arXiv. https://doi.org/10.48550/arXiv.1603.05346

Felipe Adachi

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Rich Young

Dec 10, 2024

Learn how the NIST AI Risk Management Framework (RMF) guides AI security and governance and discover how WhyLabs guardrails can help implement and manage AI risks effectively.

Read post

AI risk management
AI Observability
AI security
NIST RMF implementation
AI compliance
AI risk mitigation

Best Practicies for Monitoring and Securing RAG Systems in Production

Rich Young

Oct 8, 2024

Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate...

Read post

Retrival-Augmented Generation (RAG)
LLM Security
Generative AI
ML Monitoring
LangKit

How to Evaluate and Improve RAG Applications for Safe Production Deployment

Rich Young

Jul 17, 2024

Learn how to evaluate and improve RAG applications using LangKit and WhyLabs AI Control Center. Develop secure and reliable RAG applications.

Read post

AI Observability
LLMs
LLM Security
LangKit
RAG
Open Source

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

WhyLabs Team

Jun 2, 2024

With WhyLabs and NVIDIA NIM, enterprises can accelerate GenAI application deployment and help ensure the safety of end-user experiences WhyLabs has been on a mission to empower enterprises with tools that ensure safe and responsible AI adoption. With its integration with NVIDIA NIM inference microservices, WhyLabs is helping make responsible AI adoption more accessible. Customers can now maintain better security and control of GenAI applications with self-hosted deployment of the most powerfu

Read post

AI Observability
Generative AI
Integrations
LLM Security
LLMs
Partnerships

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

Alessya Visnjic

May 21, 2024

Discover strategies for safeguarding your large language models (LLMs). Learn how to protect your AI technologies effectively based on OWASP's top 10 security tips.

Read post

LLMs
LLM Security
Generative AI

7 Ways to Evaluate and Monitor LLMs

WhyLabs Team

May 13, 2024

Learn about 7 techniques for evaluating & monitoring LLMs, including LLM-as-a-Judge, ML-model-as-a-Judge, and embedding-as-a-source. Improve your understanding of LLMs with these strategies.

Read post

LLMs
Generative AI

How to Distinguish User Behavior and Data Drift in LLMs

Bernease Herman

May 7, 2024

Large Language Models (LLMs) rarely provide consistent responses for the same prompts over time. In this blog we’ll demonstrate how identify and monitor data changes using a few common scenarios.

Read post

LLMs
Generative AI

Run AI with Certainty

Book a demo

Understanding Kolmogorov-Smirnov (KS) Tests for Data Drift on Profiled Data

The Kolmogorov-Smirnov Test

Data Profiling

Experiment Design

Experiment #1 — Data Volume

Experiment #2 — No. of Buckets

Experiment #3 — Profile Size

Conclusion

References

Other posts

Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs

Best Practicies for Monitoring and Securing RAG Systems in Production

How to Evaluate and Improve RAG Applications for Safe Production Deployment

WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control

OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety

7 Ways to Evaluate and Monitor LLMs

How to Distinguish User Behavior and Data Drift in LLMs

Run AI with Certainty

About

Resources

whylogs

WhyLabs