How to Distinguish User Behavior and Data Drift in LLMs
- LLMs
- Generative AI
May 7, 2024
Large language models (LLMs) rarely provide consistent responses for the same prompts over time. It could be due to changes in your LLM model’s performance, but it can also be a result of changes in user behavior, system behavior, or a phenomenon in the real world.
When you find changes in your production data, you might ask yourself the following:
Were the changes caused by user behavior?
Were the changes caused by system behavior?
Were the changes due to fundamental changes in how things relate to one another in the real world?
Were the changes reflective of a difference in data quality?
Scenario Overview
Let’s demonstrate how these issues may present themselves using a few scenarios and how to monitor them. For all four scenarios below, let’s assume we represent an online international clothing retailer that sells adult clothing. The machine learning and data science team seeks to recommend one clothing type with the highest sales potential for each customer in the database. We’ll assume we have the following data:
Inputs (X): Customer information (demographics, web page impressions, past purchase history)
Outputs (Y): Model prediction of preferred clothing type
Scenario A: User behavior change
Let’s say our business is adding new customers and opening business to a new customer segment, children and teens. In this situation, the distribution of our customer input data is certainly changing – teens and children have different characteristics and web behaviors than most adults. This is a clear example of covariate drift, or a change in the input distribution (X).
Detecting changes in input can be fairly simple. We’ll see changes in customer age, the types of clothing being viewed, more views during the back-to-school season in the Fall, or others depending on what data is being collected.
But notice that while the set of possible clothing types (Y) doesn’t change, the distribution across them may indeed look different after the input data drifts. This is what makes distinguishing user behavior drift from other types of change difficult without strong monitoring tools. We’ve built tools in the WhyLabs Platform to distinguish this tricky drift in even large and complex datasets. For example, viewing how changes in output features correlate in time with various input features.
Scenario B: System behavior change
Let’s say that instead of a new customer segment, there is a change in how we categorize clothing items at the company. In this scenario, the customer information (inputs) is the same, the model (relationship between inputs and outputs) is the same, and only the categories (outputs) differ. The technical term for this phenomenon in data is prior probability shift, also known as label shift.
By examining the above graphs, it may seem like detecting and identifying system behavior change is trivial. The only data distribution that changed is our output distribution. In our particular scenario, we’d see that our model is now suggesting “pants” as a clothing category for the exact input data that previously produced “bottoms”. But there are underlying questions about the relationship and relative distribution of X and Y that are important to answer.
Scenario C: Change in predictive model
For our third scenario, let’s imagine that the endpoint connected to our predictive model was unknowingly changed to point to an old model version. Alternatively, our organization may have been relying on a third party machine learning solution that was retrained or improved in some way. The data is said to have undergone concept drift, or a change in the relationship between the inputs and outputs.
When analyzing data with the same input distribution, we’ll see a different output distribution. The graphs this produces might be familiar because they can look identical to Scenario B! But realistically, the input distribution can change alongside the model due to changes in features selected, data transformations, and dimensionality steps. We can see both inputs (positions) and outputs (colors) in this more complex scenario.
Changes in predictive model performance are further diagnosable using model evaluation metrics. Accuracy, precision, recall, F1 scores, squared error, discounted cumulative gain, and many other metrics are common examples. We can have these values calculated quickly and privately by pairing input and output data with ground truth labels that can be considered the correct prediction.
This shows how quickly we can identify and diagnose model performance drift in the WhyLabs Platform.
Scenario D: Change in underlying relationships
Finally, let’s consider when fundamental relationships in the real world change, not just the model. We experienced this scenario starting in 2020; a global pandemic forces many people to shelter in place and causes a fundamental change in shopping behaviors and needs. Suddenly, less office and vacation attire is needed while more comfortable home attire is preferred.
Without additional context, the type of data drift in this scenario (concept drift) is identical to that in Scenario C above but is caused by real world differences instead of model differences. Detecting such a change needs to be done with freshly labeled data.
Obtaining new labels can be expensive and time consuming, but only a small representative sample of data needs to be relabeled. Effective monitoring solutions for production scenarios must allow you to send partially labeled data and associate those labels with different time periods.
Wrap up
There are several different causes for changes in production data all with varying underlying statistical characteristics. Monitoring solutions that are built for identifying and distinguishing between changes in users, systems, models, and the outside world rely on many different tools to do so.
To start exploring this in your data, get started with a free WhyLabs account or schedule a call with us to learn more!
Other posts
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI