Data Drift Monitoring and Its Importance in MLOps
- ML Monitoring
- Data Quality
- WhyLabs
- Whylogs
Aug 29, 2023
Machine Learning (ML) is now an essential tool in most modern businesses, driving everything from predictive analytics to AI-enhanced applications. However, to ensure the effectiveness of your models, it's important to continuously monitor and manage ML performance, this process is known as Machine Learning Operations (MLOps). One crucial aspect of MLOps is managing "data drift." But what is data drift, and why is it so important to monitor it in your MLOps pipeline?
This post covers:
- What is data drift
- The consequences of ignoring data drift
- Data drift monitoring in MLOps
- How to monitor data drift with whylogs
- Mitigating data drift
- Conclusion
What is data drift?
Data drift refers to the change or variation in the input data of your ML model over time. It can occur due to a variety of reasons: the data might change naturally with time with seasons, the patterns and behaviors of the users might evolve, or the business environment itself might shift, altering the data being fed into the model.
Simply put, the model's predictions are only as good as the data it is trained on. If the data that the model is seeing in the production environment starts to drift outside the distribution of the data it was trained on, the model's performance could decrease substantially.
In this blog we’ll focus on covariate drift, this form of data drift occurs when the statistical properties of the input features in production change over time. We’ll cover other types of model drift in future blog posts.
The consequences of ignoring data drift
Depending on your ML application, ignoring data drift can have serious consequences. The performance of your ML models can decline without your knowledge, leading to inaccurate predictions and suboptimal decisions. This could also lead to a loss of trust in the models or product, making stakeholders and customers reluctant to rely on them.
For example, consider a credit card fraud detection model. The patterns of fraudulent transactions may change over time as fraudsters adapt their strategies. If the model is not adjusted to reflect these changing patterns, the number of false positives and false negatives can increase, potentially resulting in financial loss or even damage to the company's reputation.
Data drift monitoring in MLOps
Given the potential consequences, integrating data drift monitoring into your MLOps pipeline is important. Continuous monitoring can help you detect and address any data drift to maintain your ML models' performance and reliability.
To implement data drift detection, you first need to define what constitutes a significant drift for each feature in your model. Then, by continuously comparing the distribution of the training data with that of the data in production, you can detect any significant drifts.
Different statistical tests can be used for comparison, such as the Kullback-Leibler (KL) divergence or the Kolmogorov-Smirnov (KS) test. These tests give you a measure of how much the data distributions differ, which can be used to trigger alerts if the drift exceeds a certain threshold.
Mitigating data drift
Once data drift is detected, the next step is mitigation. A common approach is to annotate the new data and retrain the model. You may want to compare model performance between models before deploying to production.
A well structured MLOps pipeline can help automate these steps, minimizing the manual effort required to retain models and ensuring faster response times by triggering workflows when data drift is detected. At a minimum, ML monitoring should be configured to send an alert when data drift occurs so you can take action.
How to detect data drift
Fortunately, MLOps is a quickly maturing field and many tools now exist to help make ML pipelines robust and responsible! We’ll take a quick look at how you can use the open source library, whylogs.
Once you install whylogs in any Python environment using `pip` profiles of your dataset can be created with just a few lines of code! These data profiles only contain summary statistics about your dataset and can be used to monitor for data drift and data quality issues without compromising your raw data.
import whylogs as why
import pandas as pd
# profile pandas dataframe
df = pd.read_csv("path/to/file.csv")
profile1 = why.log(df)
Next, we can get a data drift report between profiles using the built in `NotebookProfileVisualizer`. By default whylogs will use KS test to calculate the drift distance between the profiles, but other popular drift metrics can be configured instead.
# Measure Data Drift with whylogs
from whylogs.viz import NotebookProfileVisualizer
visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=profile_view1, reference_profile_view=profile_view2)
In the example below we can see that data drift has been detected for the “petal length” feature in the iris dataset and drift score has been calculated using KS test.
To get a better visualization of the data drift on an individual feature, we can use the `double_histogram` to overlay the histograms of the petal length feature for each profile.
visualization.double_histogram(feature_name="petal length (cm)")
In this example, we can see the distribution between the two profiles hardly overlap, indicating a very large distribution drift.
To return the data drift metrics use `calculate_drift_scores` from whylogs. This will return a Python dictionary containing the data drift metric, scores, and thresholds for each feature. Learn more about adjusting these parameters in this example.
from whylogs.viz.drift.column_drift_algorithms import calculate_drift_scores
scores = calculate_drift_scores(target_view=profile_view1, reference_view=profile_view2, with_thresholds = True)
print(scores)
Returned data drift metrics in a Python dictionary.
{'sepal length (cm)': {'algorithm': 'ks',
'pvalue': 0.2694519362228452,
'statistic': 0.11333333333333329,
'thresholds': {'NO_DRIFT': (0.15, 1),
'POSSIBLE_DRIFT': (0.05, 0.15),
'DRIFT': (0, 0.05)},
'drift_category': 'NO_DRIFT'},
'sepal width (cm)': {'algorithm': 'ks',
'pvalue': 0.9756502052466759,
'statistic': 0.05333333333333334,
'thresholds': {'NO_DRIFT': (0.15, 1),
'POSSIBLE_DRIFT': (0.05, 0.15),
'DRIFT': (0, 0.05)},
'drift_category': 'NO_DRIFT'},
'petal length (cm)': {'algorithm': 'ks',
'pvalue': 0.9993989748100714,
'statistic': 0.04000000000000001,
'thresholds': {'NO_DRIFT': (0.15, 1),
'POSSIBLE_DRIFT': (0.05, 0.15),
'DRIFT': (0, 0.05)},
'drift_category': 'NO_DRIFT'},
'petal width (cm)': {'algorithm': 'ks',
'pvalue': 0.9756502052466759,
'statistic': 0.053333333333333344,
'thresholds': {'NO_DRIFT': (0.15, 1),
'POSSIBLE_DRIFT': (0.05, 0.15),
'DRIFT': (0, 0.05)},
'drift_category': 'NO_DRIFT'}}
You can use these values to monitor for data drift between two profiles directly in your Python environment.
We can go further in ML monitoring for data drift by using the WhyLabs Observatory. The WhyLabs Observatory makes it easy to store, visualize, and monitor profiles created with whylogs.
In order to write profiles to WhyLabs, we’ll create an account and grab our `Org-ID`, `Access token`, and `Project-ID` to set them as environment variables in our project.
# Set WhyLabs access keys
os.environ["WHYLABS_DEFAULT_ORG_ID"] = 'YOURORGID'
os.environ["WHYLABS_API_KEY"] = 'YOURACCESSTOKEN'
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = 'PROJECTID'
Once the access keys are set up, we can easily create a profile of your dataset and write it to WhyLabs. This allows us to monitor input data and model predictions with just a few lines of code!
# initial WhyLabs writer, Create whylogs profile, write profile to WhyLabs
writer = WhyLabsWriter()
profile= why.log(dataset)
writer.write(file=profile.view())
Now we can enable a pre-configured monitor with just one click (or create a custom one) to detect anomalies in our data profiles. This makes it easy to set up common monitoring tasks, such detecting data drift, data quality issues, and model performance.
Once a monitor is configured, it can be previewed while inspecting the feature it's set to monitor.
When data drift is detected, notifications can be sent via email, Slack, or trigger a workflow using PagerDuty. Set notification preferences in Settings > Global Notification Actions.
That’s it! We have gone through all the steps needed to monitor for data drift in ML pipelines to get notified or trigger a workflow if drift occurs.
If you’d like to follow along with a full example in a notebook checkout the WhyLabs onboarding guide.
Data drift conclusion
As we’ve seen, data drift is a critical consideration in the life cycle of ML models. As the world and the data we collect continually evolve, our models must adapt to stay relevant and reliable. Integrating data drift monitoring into your MLOps pipeline is necessary to ensure the continuous delivery of high-performing ML models.
By understanding, monitoring, and mitigating data drift, you can increase the longevity of your ML models, maximize their value, and keep stakeholders confident in the insights they produce. The ultimate goal is to make your ML systems robust, reliable, and resilient in the face of change, a principle that lies at the core of effective MLOps.
Learn more about how to detect data drift with these resources:
- ML Monitoring in Under 5 Minutes
- Understanding Kolmogorov-Smirnov (KS) Tests for Data Drift on Profiled Data
Ready to implement data and ML monitoring in your own applications?
- Check out whylogs for open-source data logging
- Create a free WhyLabs account for data and ML monitoring
- Join our Community Slack channel to ask questions and learn more
Other posts
Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs
Dec 10, 2024
- AI risk management
- AI Observability
- AI security
- NIST RMF implementation
- AI compliance
- AI risk mitigation
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI