Best Practicies for Monitoring and Securing RAG Systems in Production
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
Oct 8, 2024
TL;DR
- Efficient system design is crucial for RAG applications to ensure scalability and manage costs. This involves implementing optimized storage and retrieval solutions like vector databases and developing caching strategies to reduce retrieval times and system load.
- Maintaining data quality is essential for RAG systems to deliver accurate and relevant responses. Key activities include regularly cleansing the knowledge base to remove outdated content, enriching it with up-to-date information, and collecting and validating new data to enable future improvements.
- Real-time reranking of retrieved documents based on relevance to the user's query can significantly improve the RAG system's performance. The system prioritizes and delivers the most pertinent information to users by extracting features, applying a reranking model, and reordering results.
- Tracking key metrics like context adherence, completeness, and resource utilization using tools like LangKit is critical for maintaining a high-performing RAG system.
- Proactive monitoring with tools like the WhyLabs Control Center lets you step in at the right time, setting limits to make sure AI and data-driven decisions are made responsibly for ongoing development and performance fine-tuning.
Introduction
Retrieval-augmented generation (RAG) systems combine advanced retrieval techniques with large language models (LLMs) to improve the responses they generate. These systems retrieve relevant information from large datasets and augment them with external information to generate correct, context-rich answers. This makes them necessary for LLM applications that require up-to-date information.
However, to keep the application's performance at its best in production, you must address problems like the RAG's complex architecture, retrieval, and generative components. Such problems include maintaining data freshness, mitigating bias, monitoring metrics such as model drift, and adapting the system to evolving user needs.
Without continuous monitoring and maintenance, a RAG system can become outdated, limiting its usefulness and producing wrong or misleading results.
The previous article showed how to evaluate your RAG system before deployment. This article will teach you the core aspects of monitoring and maintaining a RAG application after deployment.
You will learn:
- Strategies for maintaining data quality
- How to monitor key metrics for insights on functional and operational performance with WhyLabs
- How to continually maintain your production RAG systems based on key monitoring insights.
Ready? Let’s jump right in! 🚀
Deployment must-haves for RAG systems
In the previous article, we looked at how RAG systems worked. Here’s a high-level recap of the process:
- User query: A user sends a natural-language prompt or question to the system.
- Embedding: The LLM encodes the user query into a high-dimensional vector, capturing its semantic meaning.
- Vector search: This vector is compared against other vectors, typically representing documents, passages, or specific knowledge pieces stored in the vector database.
- Retrieval: The system identifies and retrieves the most relevant data based on vector similarity calculations.
- Response generation: The LLM incorporates this retrieved information into its response generation process, blending its internal, trained knowledge with the external data.
This integration ensures the LLM produces more accurate and contextually relevant responses.
The lifecycle of a generative AI (GenAI) application spans several key phases:
- Development
- Deployment
- Ongoing operations and maintenance
- Continuous improvement
While these phases share similarities with the traditional ML lifecycle, GenAI applications require tools specifically designed to evaluate the usefulness of their generative outputs.
During the development phase, the focus is on rapid feedback for faster and safer iterations. However, post-deployment operations shift toward serving real-time user needs, necessitating robust monitoring, automated alerting, and the implementation of guardrails.
These safeguards enable swift responses to potential issues, ensuring the application remains reliable and effective. By monitoring the right metrics, you can make data-driven decisions to continually improve the application.
Data collection and quality assurance
The effectiveness of a RAG system hinges on the quality and relevance of its underlying data. Ensuring high standards involves:
- Data cleansing and updates: To maintain accuracy and relevance, regularly remove outdated or irrelevant content and enrich the knowledge base with up-to-date information.
- New information collection: Implement automated pipelines or collect human feedback to gather new data for potential fine-tuning. Validate this data before incorporating it into the system.
Scalability and cost optimization: maintaining efficiency
Efficient system design is crucial for managing costs and ensuring scalability in RAG systems. Here are key considerations:
- Efficient data solutions: Use vector databases optimized for specific query types to enable rapid access and scalability.
- Caching: Use caching strategies to temporarily store frequently requested data, reducing retrieval times and system load.
- Token tracking: Monitor token counts and LLM usage to ensure you keep costs under control. A guardrails policy for LLM costs can be beneficial.
Evaluation and Monitoring
Monitoring your RAG system is crucial for maintaining its optimal performance and delivering accurate, contextually enriched responses. You can identify areas where the system struggles and gain insights into where it needs development by regularly evaluating it.
Furthermore, monitoring helps detect issues in real-time to prevent inappropriate prompts and responses, such as those used to jailbreak an LLM system or reveal personal information.
Comparing your data with user inputs and the LLM’s output can help measure the following dimensions of your RAG system:
- Context adherence (precision): Measures how accurately responses align with the provided query for relevant and contextually appropriate answers.
- Completeness (recall): Evaluates the thoroughness of responses in covering all aspects of the query for comprehensive and detailed answers.
- Chunk attribution: Identifies which data chunks (segments or pieces of information from your knowledge base) were used to generate a response.
- Chunk utilization: Measures how effectively the retrieved chunks are used in the final response.
Monitoring specific metrics provides deeper insights into the system's performance across these dimensions, allowing for timely interventions and optimizations.
You can easily record and log metrics from your RAG-enabled LLM application using tools like LangKit (see previous article), which you can then send to the WhyLabs platform for monitoring.
The following are the most crucial metrics for measuring and monitoring your RAG system, and how LangKit can assist:
Toxicity:
- Monitoring toxicity in RAG systems is vital for user safety, brand reputation, and ethical compliance. It helps identify and address harmful language, ensuring the system operates safely and aligns with organizational values.
- Recommended WhyLabs monitor:
- To detect toxic content, prompt the monitor (stddev) to compare a query to a trailing window baseline.
- The response monitor (stddev) compares the RAG response to a trailing window baseline to detect responses with high toxicity scores.
Sentiment:
- In RAG systems, monitoring sentiment is critical for understanding user sentiment and improving user engagement. By grouping responses based on sentiments, you can identify areas where the system might provide neutral or negative responses and improve them. You can use the insights to improve the RAG system's training data and algorithms to generate more positive and intriguing responses.
- Recommended WhyLabs monitor:
- You can configure sentiment monitors in WhyLabs to track changes based on statistical values (mean, median, p95, etc.).
Text quality and relevance:
- Monitoring text quality and relevance ensures RAG systems deliver high-quality, informative responses. Evaluate output quality, readability, relevance, and appropriateness to identify areas for improvement and align with user expectations.
- Recommended WhyLabs monitor:
- Response monitor (stddev) to detect low relevance to the user’s query in the RAG responses.
- WhyLabs makes it simple to configure text quality and relevance metrics to track changes based on statistical values (mean, median, p95, etc.).
PII (personally identifiable information):
- Monitoring PII in RAG systems is crucial for protecting user privacy and adhering to data privacy regulations. You can prevent sensitive information from getting into the wrong hands by identifying and redacting any PII that appears in responses. There are fewer legal and reputational risks when you do this. Users will trust you more.
- Recommended WhyLabs monitor:
- WhyLabs can detect and flag or block PII in both prompts and responses through WhyLab’s customer experience and misuse policy rulesets.
Jailbreaks, prompt injections, hallucinations, and refusals:
- Monitoring for jailbreaks, prompt injections, hallucinations, and refusals is crucial for safeguarding the security and integrity of RAG systems. Malicious actors can use these techniques to manipulate the system's outputs, potentially leading to harmful or misleading information. By detecting and mitigating these risks, you can protect user trust and stop malicious intent from exploiting the system.
- Recommended WhyLabs monitor:
- WhyLabs tracks metrics that enable automated monitoring for jailbreaks, prompt injections, hallucinations, and refusals.
LangKit gives you access to three major modules you can use to track the security health of your RAG’s system:
Guardrails
Implement real-time guardrails to protect your production LLM-based applications. These safeguards use prompt and response data to flag or block harmful or inappropriate interactions, ensuring responsible AI adherence.
WhyLabs AI control center is a platform that tracks changes in AI behavior patterns. Using WhyLabs Secure, it can help set up real-time guardrails that guide the app toward safe behavior.
WhyLabs secure is a comprehensive platform that enhances the security and reliability of AI applications. It helps extract metrics, detect issues, and measure their severity at low latencies.
Key features include WhyLabs secure policy, which defines the rules and actions that apply to your LLM applications, and the Guardrails API, a RESTful API that allows you to interact with the WhyLabs Guardrails service.
WhyLabs Secure Policies are stored in a policy document in YAML or JSON format. WhyLabs centrally manages and versions the policy, which the WhyLabs Guardrails deployment uses to enforce the rules and actions.
By using these features, organizations can enhance security, improve reliability, increase compliance, and gain greater control over their AI applications.
System resources
Monitoring system resources is also crucial for ensuring optimal operational performance and identifying bottlenecks in the overall RAG system.
Here are the operational metrics you should consider monitoring post-deployment:
- Resource utilization: Track CPU, memory, disk, and network usage to ensure optimal resource allocation.
- Latency: Measures the system's response time (latency) to ensure timely and responsive interactions.
- Error rates: The frequency and types of errors to identify and resolve issues that may impact the user experience or data integrity.
By systematically tracking and analyzing these metrics and implementing robust safeguards, you can maintain a high-performing RAG system that is reliable, fair, and responsive to user needs.
Walkthrough: monitoring prompt-response sentiment of a LangChain RAG system with LangKit and WhyLabs
Overview
In this walkthrough, we’ll guide you through monitoring your LangChain-based RAG application using LangKit for out-of-the-box language metrics and WhyLabs for real-time production monitoring.
By the end of this guide, you will be able to track the sentiment of user prompts and the responses generated by your application, helping you ensure that your LLM interactions meet user expectations.
Let’s walk through the process of installing the necessary tools, configuring credentials, generating responses, and finally monitoring sentiment metrics in WhyLabs.
Step 1: install LangKit and LangChain
First, install the necessary libraries to generate language metrics and work with your LangChain RAG system. Use the following commands in your terminal or notebook environment:
%pip install langkit[all]==0.0.2
%pip install langchain==0.0.205
Step 2: set OpenAI and WhyLabs credentials
To send LangKit profiles to WhyLabs, you will need the following:
- API token
- Organization ID
- Dataset ID (or model-id)
Getting credentials
- Go to WhyLabs Free Account and sign up.
- Create a new project and note its ID (if it's a model project, it will look like model-xxxx).
- Create an API token from the "Access Tokens" tab.
- Copy your organization ID from the same "Access Tokens" tab.
- Obtain your OpenAI API key from your OpenAI account.
Note: Your API keys are sensitive information, so avoid sharing them publicly.
Setting Up credentials in your code
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
os.environ["WHYLABS_DEFAULT_ORG_ID"] = getpass.getpass("Enter your WhyLabs org ID: ")
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = getpass.getpass("Enter your WhyLabs model ID: ")
os.environ["WHYLABS_API_KEY"] = getpass.getpass("Enter your WhyLabs API key: ")
Step 3: Import necessary modules
Import the required modules from LangChain, LangKit, and WhyLabs to handle callbacks, generate responses, and collect sentiment metrics.
from langchain.callbacks import WhyLabsCallbackHandler
from langchain.llms import OpenAI
# Import additional language metrics
import langkit.sentiment
import langkit.topics
Step 4: initialize the WhyLabs callback and GPT model
To start monitoring your LangChain RAG system, initialize the WhyLabs callback handler and the OpenAI GPT-3.5 model. This will allow us to track the prompts and responses in real time and push metrics to WhyLabs.
# Initialize WhyLabs Callback & GPT model with LangChain
whylabs = WhyLabsCallbackHandler.from_params()
llm = OpenAI(temperature=0, callbacks=[whylabs], model_name="gpt-3.5-turbo")
Step 5: generate responses and log sentiment metrics
Let’s now generate responses based on some example prompts and log the metrics to WhyLabs. The rolling logger for whylogs will write profiles every five minutes or when `.flush()` or `.close()` is called.
This example uses positive prompts to demonstrate how LangKit tracks sentiment.
# Generate responses to positive prompts
# Use a loop to iterate over the prompts and generate responses individually
for prompt in [
"I love nature, its beautilful and amazing!",
"This product is awesome. I really enjoy it.",
"Chatting with you has been a great experience! you're very helpful.",
]:
result = llm.generate([prompt])
print(result)
# Close WhyLabs session to push profiles
whylabs.close()
Step 6: view sentiment metrics in WhyLabs
Once the WhyLabs session is over, the platform automatically receives the sentiment profiles. Navigate to the Profiles tab in your WhyLabs dashboard, and click on "View Details" under the `prompt.sentiment_nltk` metric.
Here, you’ll see a distribution of sentiment scores for the prompts and generated responses:
In this example, you should see a high percentage of positive sentiment scores, with most prompts registering above 90% (0.9). Now, let's experiment by generating responses to negative prompts to observe how sentiment changes.
Step 7: generate responses to negative prompts
Run the following code to generate responses to negative prompts and track the changes in sentiment scores.
# Generate responses to negative prompts
for prompt in [
"I hate nature, its ugly.",
"This product is bad. I hate it.",
"Chatting with you has been a terrible experience!."
"I'm terrible at saving money, can you give me advice?"
]:
result = llm.generate([prompt])
print(result)
# Close WhyLabs session to push profiles
whylabs.close()
After running the above code, go back to the sentiment metrics in WhyLabs. You should now see a shift in sentiment scores, with some prompts registering lower, negative scores (e.g., 40%).
To learn more about language metrics for prompts and responses, click on the "View Insights" button.
Step 8: configuring monitors and alerts for sentiment changes
You can set up monitors to notify you when sentiment values change. WhyLabs provides both preset and custom monitors to track different metrics, ensuring you're alerted about significant shifts in sentiment or any other monitored aspect of your RAG system.
- Go to the "Monitor Manager" tab: Here, you can configure sentiment change alerts.
- Select a preset monitor or create a custom one: Set your threshold values for automatic notifications.
For more detailed guidance on configuring notifications and actions, refer to the WhyLabs documentation.
By using LangKit and WhyLabs, you gain valuable insights into how your LangChain-based RAG system performs in real-world scenarios. Monitoring metrics such as sentiment provides actionable data to ensure your application remains responsive, reliable, and aligned with user expectations.
Continually maintaining the RAG system
After deploying your RAG system, you continue to improve and fine-tune it. Post-deployment maintenance involves analyzing collected data and insights to make necessary adjustments and enhancements.
This stage is critical for fine-tuning LLMs with new data and updating retrieval mechanisms to ensure your RAG system remains effective, relevant, and adaptable to evolving user needs.
Monitoring as a catalyst for improvement
Continuous monitoring plays a pivotal role in this ongoing development process. Monitoring the key performance indicators, such as text relevance, context adherence, and completeness, can provide valuable insights into the following:
- Areas for improvement: Identify specific aspects of your system that require attention, such as improving retrieval accuracy or addressing response biases.
- Optimal timing for updates: Determine the right moments to fine-tune your LLM, update your knowledge base, or adjust retrieval parameters.
- Model drift detection: Identify when your model's performance degrades due to changes in data distribution or user behavior.
- Understanding user behavior: Gain insights into how users interact with your system, which will help you tailor future improvements to their needs.
- Fairness and bias mitigation: Monitor your system's outputs for potential biases and take corrective action to ensure fairness.
Fine-tuning for domain-specific expertise
Fine-tuning LLMs for specific domains remains crucial even after deployment, but monitoring can be essential in determining when it's the right time. RAG systems dynamically get information from outside sources, but fine-tuning can help add stable, domain-specific knowledge to the model. This lowers the amount of computation needed during runtime and improves the accuracy of responses.
Monitoring your RAG system’s performance is key to knowing when fine-tuning becomes necessary. For instance, if your system's responses begin to deviate or become less relevant to user queries, it could signal that it’s time to integrate information into the LLM’s weights by statically fine-tuning.
RAGs get information on the fly, but fine-tuned models store this information inside the model. This lets them respond faster and more accurately while using less computing power.
For example:
- Once monitored for declining accuracy in response quality, a medical LLM might benefit from fine-tuning with updated clinical guidelines or new research data.
- In finance, if RAG responses start failing to align with current regulations or market conditions, it might be time to fine-tune the LLM to reflect these stable changes in the domain.
- Fine-tuning can help make sure that the LLM is always up to date with new case law in legal applications. This reduces the need for dynamic retrieval when static incorporation is more efficient.
Think of it as the difference between someone constantly referencing a textbook (the RAG approach) versus someone who has fully absorbed the content and can reason, connect ideas, and respond in a more nuanced and cohesive way (a fine-tuned model).
Monitoring will allow you to identify when the dynamic retrieval approach is insufficient and when fine-tuning is a better solution to maintain optimal performance.
WhyLabs optimize: your partner in continuous improvement
Tools such as WhyLabs's optimize module can streamline the continuous development process. With WhyLabs optimize, you can:
- Build evaluation and red team datasets to assess the performance of your RAG system and identify potential vulnerabilities or biases.
- Continuously tune policies based on monitoring insights and performance metrics.
- Choose an optimal model based on predefined criteria and automate model updates.
- Monitor and analyze the impact of fine-tuning and adjustments on system performance.
Addressing these challenges and looking ahead
While post-deployment maintenance is essential, it's important to acknowledge potential challenges like data labeling costs and model retraining time.
By proactively planning for these complexities and using tools like WhyLabs optimize, you can ensure your RAG system continues to evolve, adapt, and deliver value long into the future.
Conclusion: how to monitor and secure RAG applications in production
Applications that use RAG evolve quickly. It is important to keep your system current, perform thorough maintenance, and make changes based on new information and user needs.
Implementing the strategies discussed in this article—from monitoring key metrics and fine-tuning your LLM to optimizing retrieval and incorporating human feedback—you can ensure your RAG system continues to perform optimally and delivers accurate, valuable insights.
Continuous monitoring entails implementing best practices and being vigilant about responding to issues as soon as possible. Tools like WhyLabs and LangKit can help you stay proactive, enabling you to address issues in real-time while adapting to shifts in data and user behavior.
Review your current RAG system for gaps in monitoring effectiveness, retrieval quality, model accuracy, security vulnerabilities, or feedback incorporation. Specifically, WhyLabs enables you to establish active alerts automatically when abnormalities or performance issues arise.
As mentioned, you can start by defining LangKit profiles using the metrics you want to monitor. After creating these profiles, upload them to the WhyLabs platform and set up the necessary alerts to receive immediate notifications for action. This real-time alerting system catches any deviation from expected behavior early, enabling swift action to maintain optimal system performance.
1. Create a free WhyLabs account to begin tracking your LLM application.
2. Check out more examples of Langkit and WhyLabs LLM application monitoring.
Explore our additional resources below to learn more about maintaining your RAG applications.
Additional Resources
- 7 Ways to Monitor Large Language Model Behavior.
- Uploading profiles to WhyLabs.
- WhyLabs Learning Center.
- How to Evaluate and Improve RAG Applications for Safe Production Deployment
- AI Observability is Dead; Long Live AI Observability! Introducing WhyLabs AI Control Center for Generative and Predictive AI.
- Monitoring LLMs in Production using Hugging Face & WhyLabs.
Other posts
How to Evaluate and Improve RAG Applications for Safe Production Deployment
Jul 17, 2024
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI