Data Drift vs. Concept Drift and Why Monitoring for Them is Important
- ML Monitoring
- Data Quality
Jan 1, 2024
As the volume of data generated and used in organizations grows, maintaining the accuracy and relevance of the data used for modeling and analysis becomes increasingly important. Since machine learning (ML) models rely on data for training, their accuracy may degrade over time due to data distribution changes triggered by environmental conditions, user behavior, or data inconsistencies from sources. Data drift and concept drift are two common challenges that can impact ML models in production. The underlying causes of these drifts are complicated and can be easily confused.
This blog explores:
- What concept and data drifts, including their causes and mathematical representations
- The major approaches and strategies for managing these drift forms,
- Their underlying causes and emphasizes the value of detecting them.
By the end of this article, you will understand these drift variants and how you can detect them.
Concept drift vs. data drift
The key difference between data (or feature) drift and concept drift is that data drift refers to changes in the underlying distribution of the input data used for modeling. In contrast, concept drift refers to changes in the relationships between the input features and the target variable that a model is trying to predict.
Each dataset ingested into a model for training or inference consists of the input features and target variable (for supervised learning problems). These feature variables are technically data points on a dimensional space, and the resulting target variable is achieved as an outcome of the average effects and relationship between the feature data points.
Understanding data drift
With data drift, the core relationship between these inputs and the output variable remains constant, but the characteristics of the inputs themselves evolve. This dynamism, resulting from real-world changes, can disrupt the model's performance because the weights assigned to each feature during training no longer align with the new input distributions encountered in production.
Consider this analogy: The model you train to predict weather patterns based on historical data may struggle if climate change alters these patterns significantly. Although the relationship between weather features and outcomes hasn't changed, the features themselves have.
In mathematical terms, think of it as a discrepancy between the input feature distribution during training (let’s call it Ptrain(x), and the distribution during inference Pinference(x)). When Ptrain(x) significantly differs from Pinference(x), the model’s predictions become less accurate. This issue arises because, based on the initial data distribution, the model's knowledge no longer reflects the current reality.
Data drift is a subtle yet impactful form of bias in machine learning, stemming from changes in input feature distributions rather than direct alterations in the output or the relationship between input and output (as with concept drift). Recognizing and addressing data drift is crucial for maintaining the accuracy and reliability of predictive models over time.
In the next section, let’s look at some of the causes of data drift.
The causes of data drift
Shifts in the statistical properties of the data, specifically in the distribution of input features, primarily cause data drift. The deviation in feature data can occur due to various factors:
- Changes in user behavior: Shifts in consumer preferences, habits, or some demographics being more active than others when using the application.
- Market dynamics: New trends, competition, or economic factors influencing the data environment.
- Policy changes: Regulatory updates or company policy changes affecting data collection or user interaction.
- Changes in upstream data systems: Alterations in gathering, storing, or processing data can impact pipeline management CRM. For instance, introducing new categories or features to the data pipeline or database.
- Data quality issues: Poor data quality, errors, or missing data from the processing or pipeline code can also cause drift.
- System updates: Modifications in the operational systems that generate or handle data.
These factors can subtly alter the data dynamics over time, making the changes difficult to detect and control. The most significant challenge with data drift, though, is its unpredictable nature and the gradual way it can degrade the performance of a model.
Models trained on historical data may not account for these unforeseen shifts, inevitably leading to a decline in accuracy and relevance. Therefore, it is crucial to implement continuous monitoring and adaptive strategies to identify and mitigate the impacts of data drift on your production models.
Examples of data drift
- Seasonal variability:
- A retail company's model, trained on holiday season sales data, may perform poorly during off-season months due to different consumer buying patterns. Continuous monitoring and periodic model retraining can help adjust for these predictable seasonal shifts.
- Technological changes:
- A manufacturing plant's predictive maintenance model might degrade in accuracy when new sensors are installed or old ones are replaced. Changes can then occur between the old and new data patterns, leading to inaccurate predictions or assessments by the models trained on data collected from the old sensors. These changes can alter machinery health's “data signature” (the unique pattern or profile of data that characterizes normal operation).
- You can mitigate this drift by implementing a system to recalibrate or retrain the model when sensor updates occur.
- Geographical expansion:
- A business expanding from its initial market into new regions may find that user behavior and demographics in these areas differ, leading to data drift. You can maintain accuracy by adapting the model to include region-specific data or developing separate models for each market.
- Cultural evolution:
- A large language model (LLM) used for sentiment analysis in online reviews may become less reliable as language evolves, including changes in slang, emojis, and idioms. Regularly fine-tuning the model on fresh data from production or using a newer, updated model, incorporating the latest language trends, can help maintain its relevancy and accuracy.
These examples underscore the importance of understanding the context and dynamics of your data. Proactive strategies, including model monitoring, obsessiveness, and maintenance, can help address the challenges posed by data drift and ensure models are updated over time to adapt to changes in the real world and avoid training-serving skew (training data distribution different from production data).
How to detect data drift
Detecting data drift involves continuously monitoring the distribution of your model's input and output over time and identifying when it deviates significantly from the original training data. Here are some methods and techniques you can use to detect data drift:
Data visualization
Visual methods can provide an intuitive sense of drift:
- Histograms or density plots: Visually compare data distribution at different time points.
- Box plots: Show changes in the distribution's quartiles and can indicate drift in the median and range.
Distribution metrics
These metrics give you a sense of how much the distributions in your training and production sets differ:
- Population Stability Index (PSI): Compares the distribution of a feature in your training set to the same feature in the new data. This article goes into detail on how you can calculate and interpret it.
- Bhattacharyya Distance is a helpful measure for calculating the similarity between two probability distributions.
Statistical tests
These tests compare changes in the distribution of your input data at inference with the one at training:
- Kolmogorov-Smirnov Test (KS Test): Compares the inference data with a reference probability distribution (the training set) to determine if they differ significantly. It's useful for detecting changes in the cumulative distribution function of continuous data.
- Chi-Square Test: This one is great for categorical features to compare the observed frequencies in each variable category to the frequencies you would expect if there were no changes in the distribution over time.
Windowing techniques
These methods involve comparing recent data to historical data:
- Fixed windowing: Compare the inference data against a fixed reference window (like the training set) and then use a statistical test (like the KS Test) to determine the maximum difference between the cumulative distribution functions of the fixed window and rolling window (the production data). Also, see the Adaptive Windowing (ADWIN) technique.
Steps to implement data drift detection
- Define baseline: Establish the data distribution when the model was trained as your baseline.
- Set thresholds: Determine what level of change or difference is significant enough to consider as drift.
- Monitor continuously: Regularly compare new data against the baseline using one or more of the above methods.
- Alert and respond: Set up a system to alert you when these thresholds are crossed and plan how to respond, which might include retraining the model, gathering more data, or adjusting the model's parameters.
Understanding Concept Drift
Concept drift is the change in the statistical relationship between input features (X) and the output variable (Y) in a predictive model. In other words, even if the input features look the same, how they relate to the output changes, leading to different predictions over time. This might occur due to evolving trends, societal changes, or alterations in the system the model is trying to predict.
If we consider P(Y/X) as the probability of the output given the input, concept drift means that P(Y/X) changes over time if P(X), the probability of the input features remains the same. In practical terms, the rules the model learned during training no longer apply due to shifts in the underlying reality or context of the use case.
To clarify the mathematical representation for concept drift for the prediction Y given X in dynamic environments:
P(Y/X) = P(X,Y)P(X)
Where:
- The conditional probability, P(Y/X), is the probability of the output given the input, which changes in the presence of concept drift.
- The joint (or combined) probability P(X,Y) for observing the input features and predictions.
- The marginal probability P(X) for observing the input, irrespective of the output.
This representation helps you understand how changes in the relationship between input and output (concept drift) affect the model's predictions. Managing concept drift is challenging because it requires continuous monitoring and updating of the model to maintain its quality and relevance to your use case.
The causes of concept drift
After training a model, it operates with fixed weights (W) and biases (b) based on the training data. When the actual relationship between inputs and outputs changes in the real world (concept drift), the model's performance can degrade because it's still operating under the old rules—it's essentially out of sync with the current reality.
Many things can cause concept drift, such as:
- Changes in ecosystem behavior or preferences: As societal trends or personal preferences evolve, how inputs relate to outputs can change. For instance, there could be new sales campaigns with discounts that could increase conversion rates.
- Seasonal variations: There could be cyclic and seasonal variations that are not covered in the training data.
- Economic shifts or market dynamics: New regulations, competitive products, or economic conditions can alter the dynamics the model is trying to predict.
- Technological advances: Changes in the technology stack might change workflows or data collection, affecting the relationship between inputs and outputs.
Let’s look at some examples of concept drift to aid you in spotting them.
Examples of concept drift
- Marketing impact on customer behavior:
- A new marketing campaign significantly alters customer engagement and retention strategies after a model is trained to predict customer churn. As a result, the previous churn patterns no longer apply, requiring the model to be updated to reflect the new customer behavior dynamics.
- Economic influence on credit scoring:
- A model predicting loan defaults based on credit scores might become less accurate if economic conditions change the implications of certain credit scores. For instance, a previously reliable credit score might no longer accurately predict default risk during an economic downturn.
- Technological and environmental changes in equipment monitoring:
- In predictive maintenance, a model's understanding of the relationship between sensor readings and equipment failure might degrade due to wear and tear on equipment, updates or changes in sensor technology, or operating conditions.
- Medical advancements and seasonality in healthcare models:
- A language model used in healthcare might become outdated when new medications are introduced or if diseases exhibit seasonal variations in prevalence or treatment effectiveness. This can lead to the model making less accurate or outdated recommendations, highlighting the need for continuous updates to the knowledge base and model parameters.
Detecting concept drift
Since in concept drift, even if the input distribution is the same, its relationship to their output may change, and there is a direct relationship between the input features and output, it's essential to monitor for performance changes or prediction errors due to underlying data relationships continuously occurring. Concept drift can be detected in several ways, including:
Performance monitoring
Regularly track the performance of your model using relevant metrics. Here are some metrics, if you have ground truth:
- Accuracy, precision, recall, and F1-score: Monitor these metrics over time. A significant drop might indicate concept drift.
- Confusion matrix analysis: Changes in the confusion matrix pattern over time can signal concept drift.
Change detection algorithms
Specific algorithms are designed to detect changes in data streams, especially if there are no ground truths.
- Drift Detection Method (DDM): Monitors the error rate and its standard deviation to detect increases over time.
- Early Drift Detection Method (EDDM): An extension of DDM that focuses on the distribution of distances between errors to detect drift earlier.
- ADaptive WINdowing (ADWIN): An algorithm that automatically adjusts the window size and cuts it in case of detected change.
Error rate analysis
Track the error rates of predictions:
- Moving average error: Use a windowed moving average of the model's error rate. A significant increase could indicate drift.
- Page-Hinkley Test: A applied to the residuals of a model (the difference between the predicted values and actual values) or any other metric that reflects the model's performance. As the relationships between variables change, these residuals or performance metrics will shift, changing their mean value.
Data distribution analysis
Although concept drift focuses on relationship changes, analyzing data distributions can provide indirect insights:
- Two-sample tests: Use statistical tests to compare distributions of residuals or prediction errors over time.
- Feature importance tracking: Monitor changes in feature importance over time, which might indicate that the relationship between features and the target is changing.
Steps to implement concept drift detection
- Choose appropriate methods: Select one or more methods suited to your model, data, and operational constraints.
- Establish baseline behavior: Understand and document the normal behavior or distribution of the chosen indicators during a period where the model performs well (if there is no ground truth) or evaluate the model on new data and track the chosen metrics (when there is ground truth).
- Continuous monitoring: Regularly monitor the indicators for significant deviations from the baseline.
- Set alerting thresholds: Define what constitutes a significant change and set up alerting mechanisms.
- Update mechanism: Have a strategy for responding to detected drift, such as model retraining or recalibration.
Why is monitoring for drift important?
For businesses to make informed decisions, they must ensure that ML models remain accurate and reliable over time. Monitoring for both types of drift is essential and should be a crucial part of your post-deployment strategy.
Improved model performance
Monitoring for drift allows you to detect and correct the drift in real-time. This ensures that the models are always up-to-date and accurate for better performance. Based on the insights detected from monitoring, you can update your models to better capture changes in the data or underlying relationships between variables in the production data for more accurate predictions.
Avoiding bias
Drift can often lead to bias in the model, where particular groups may be unfairly disadvantaged. Monitoring for drift allows you to detect and correct bias, which can partly help models predict more ethical outcomes.
Compliance
In industries like finance and healthcare, staying compliant with changing regulations often requires models to reflect current realities accurately. Monitoring for drift is not just good practice; it's often a regulatory necessity.
Model transparency and reliability
Keeping an eye on drift might help you understand the reasoning behind a model's predictions. In doing so, you can better explain the logic behind the model and build user trust. Stakeholders will be more inclined to trust the model's predictions when they are transparent, accountable, and subject to continuous monitoring.
Implementing a machine learning monitoring system
WhyLabs provides a comprehensive set of monitoring and alerting features that can help detect drift proactively, prevent model failures, and improve model performance. This includes recommendations for retraining your models on new data, adjusting model parameters, and refining or updating features to improve performance.
Schedule a call with us to learn how WhyLabs can help detect, identify, and fix drift in your ML models, or sign up for a free account to start monitoring your models right away.
Data drift vs. concept drift: key takeaways
Understanding and detecting drift promptly ensures that your models remain robust and reliable in production. Among the various existing methods for drift detection, the choice depends on specific aspects of your use case, including the nature of the data, the anticipated type of drift, computational constraints, and whether you're working in real-time or batch-processing use cases.
Data scientists, domain experts, and the dynamic machine learning environment work together to navigate data and concept drift. The data and concept drift research horizon is ever-expanding, promising new and exciting frontiers to explore. The ultimate aim is to develop an adaptable ecosystem for a machine-learning model whereby interconnected ecosystems of data pipelines, infrastructure, and decision-making processes exist. Adaptive systems will detect drift, trigger automated responses, and seamlessly adjust the entire ecosystem to maintain model integrity and operational continuity.
Other posts
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI