How to Evaluate and Improve RAG Applications for Safe Production Deployment
- AI Observability
- LLMs
- LLM Security
- LangKit
- RAG
- Open Source
Jul 17, 2024
TL;DR
- Retrieval-augmented generation (RAG) combines information retrieval systems with large language models (LLMs) to make AI-generated text more accurate and reliable by accessing the latest, relevant data.
- To develop RAG systems, you must carefully consider the retrieval quality, LLM behavior, data quality, security, and how the LLMs and retrieval mechanisms integrate.
- Many RAGs today are evaluated using LLMs; even the datasets used to evaluate the RAG are created by LLMs! Using tools such as LangKit allows you to assess RAG components.
- LangKit also natively integrates with WhyLabs AI Control Center to provide a platform with dashboards that allow you to visualize and monitor your LLM applications for accuracy issues, malicious prompts, and more.
Today, we admire large language models (LLMs) like GPT-4 because they are so good at writing text that looks like it could have been written by a human. However, these models often struggle with accessing real-time, real-world knowledge, which can lead to potential factual inaccuracies commonly referred to as "hallucinations".
To overcome these problems, most developers have built retrieval-augmented generation (RAG) systems. These systems integrate the generative power of LLMs with robust information retrieval systems (e.g., dense or sparse retrieval).
This allows the models to pull relevant data from external sources, improving accuracy, contextual understanding, and output reliability.
Developing RAG systems is a complex process for developers. They must navigate challenges such as selecting appropriate data sources, optimizing retrieval algorithms, and ensuring seamless communication between the LLM and retrieval components.
But the development phase is only half the battle. Rigorous evaluation is equally critical before transitioning a RAG system to production. You must assess the system's performance, accuracy, and robustness under various scenarios to ensure it meets the desired standards.
By the end of this article, you will have learned how to evaluate RAG systems thoroughly to identify and address potential issues, making the systems more reliable and effective for real-world applications.
How do RAGs work?
RAGs enhance the capabilities of LLMs by integrating them with external data typically stored in vector databases. These databases store large datasets as high-dimensional vectors to efficiently retrieve contextually relevant data by mathematically calculating prompt similarity.
As a brief explanation, here's what happens when you send a prompt to a RAG-enabled LLM:
- Query formulation: The LLM receives a natural language prompt converted into an encoded vector.
- Information retrieval: The vector is compared to others in the vector database to find the most relevant data to retrieve based on the query's intent and context.
- Data integration and response generation: The retrieved information is integrated into the LLM's process of predicting the next word in its response, combining its trained knowledge with the knowledge extracted from the vector database for the final output.
Challenges to address before deploying RAG systems
Deploying RAG systems to production is challenging for many reasons, mostly because combining generative LLMs with complex retrieval mechanisms is complex.
Before even considering deployment, you must address a handful of issues while developing RAG systems:
1. Non-deterministic LLM behavior
LLMs can produce unpredictable outputs. Fine-tuning with diverse datasets and implementing uncertainty-handling mechanisms can promote consistency.
Non-deterministic behavior refers to LLMs' tendency to generate different responses to the same prompt due to the inherent randomness in their training and inference processes.
2. Quality of retrieval
A RAG system's effectiveness heavily depends on its ability to fetch relevant and diverse documents in response to user queries. Inaccurate retrieval can result in irrelevant or misleading responses, severely impacting user trust and system utility in prod.
3. Data quality issues
Most RAG systems that work well in production depend on the quality of their data sources. Outdated or erroneous data can lead to misleading answers, compromising the system's utility.
Ensuring high-quality data involves rigorous cleaning, validation processes, and continuous updates and monitoring of the knowledge base to keep the information current and relevant.
4. Integration nuances
Technical challenges in integrating retrieval and generative components can affect output quality. Retrieval latency, document selection accuracy, and efficient information passage between system components must be meticulously managed.
5. Security, safety, and compliance
Robust security, safety, and compliance measures like SOC 2 are essential to ensure the responsible development of RAG systems.
This involves protecting data privacy through anonymization techniques, adhering to data privacy regulations like GDPR or CCPA, and implementing content filtering and bias detection algorithms to prevent misuse.
Human oversight mechanisms and adversarial training are crucial for reviewing outputs and improving the system's resilience against manipulation attempts.
The Open Web Application Security Project (OWASP) Top 10 list for LLMs provides valuable guidance on potential vulnerabilities, such as prompt injection, insecure output handling, and training data poisoning.
You’ve seen some challenges you might encounter before deploying the RAG system. However, deploying it in production could also involve the following challenges:
- Scalability and robustness: The system must handle varying demands and maintain functionality under high loads, requiring a scalable architecture that doesn't sacrifice speed or accuracy.
- Predicting user interactions: Anticipating real-world user interactions is difficult. Continuous monitoring and adaptive modifications based on user data are crucial.
Evaluating the performance of RAG systems with LangKit
Much testing and experimenting are done to assess RAG performance, including model versioning, prompt versioning, frameworks, vector databases, retrieval techniques, generation techniques, etc.
LangKit is an open-source, out-of-the-box telemetry tool for LLM prompts and responses that tracks quality, relevance, sentiment, and security metrics. It can collect metrics and allow users to compare them against each strategy to make data-driven decisions about improving their assessment pipeline.
LangKit lets you:
- Evaluate whether the LLM behavior is compliant with policy
- Validate and safeguard individual prompts and responses
- Compare and A/B test across different LLM and prompt versions
- Monitor user interactions inside LLM-powered applications
Logging metrics and comparing the outputs of responses from different models and assessment strategies will also allow for quantitative and unbiased assessments for faster deployments with higher confidence.
In this section, you will go through a sample assessment pipeline using an LLM to assess a RAG. Throughout this process, you’ll see how LangKit provides assessment metrics for your RAG system.
Step 1: Generate a synthetic evaluation dataset
A synthetic evaluation dataset is a collection of artificially generated questions and corresponding answers (Q&A) used to assess the performance of a RAG system.
A robust synthetic evaluation dataset is critical for assessing your RAG's performance. It allows you to test the system's ability to handle various questions and scenarios.
However, manually crafting such a dataset can be time-consuming and limit the variety of questions. Here's where LangKit shines.
Although this can be done with any LLM with the correct prompt, LangKit can be leveraged to streamline this process and explore the effectiveness of different LLMs for generating Q&A pairs.
LangKit will allow you to quickly assess and evaluate different strategies during the dataset creation process, such as:
- Quickly assess various LLMs: Connect LangKit to different LLM models, such as OpenAI’s GPT-4o, or open-source models, such as Mixtral. If the RAG database contains domain-specific data, other fine-tuned LLMs focused on those domains can also be assessed in this step.
- Assess various prompts: Experiment with different prompts to guide the LLMs in generating relevant and informative questions. LangKit stores and manages these variations, allowing you to test which prompts yield the most effective question-answer pairs for your RAG system evaluation.
For instance, the code snippet below shows how to log the metrics of prompts and responses from two OpenAI LLMs (one is a DaVinci-based model, and the other is the default, GPT-3.5-turbo model) using LangKit:
Step 2: Critique agent evaluation
After generating the set of Q&A samples, you will require additional filtering to remove unsuitable and inappropriate questions from the dataset. To accomplish this task, use critique agents to assess the quality of the LLM-generated questions based on three critical criteria:
- Groundedness: Does the question have a clear basis in the provided context (e.g., surrounding text)?
- Relevance: Is the question helpful or informative to the user?
- Stand-alone: Can someone with domain knowledge and internet access understand the question independently without needing the specific context of the article from which it was generated?
Critique agents assign scores to each question based on these criteria, typically on a scale of 1-5. For example, a question highly grounded in the context, relevant to the user, and can stand alone would receive a score of 5. In contrast, a question that lacks context, is irrelevant, and requires specific article knowledge might receive a score of 1.
You can filter out low-scoring questions to ensure the evaluation dataset is populated with high-quality questions that effectively test the RAG system's abilities.
You can use LangKit for critique evaluation with an evaluation check like `response_hallucination` to assess the consistency between the target response and a group of additional response samples:
from langkit import response_hallucination
from langkit.openai import OpenAIDefault
response_hallucination.init(llm=OpenAIDefault(), num_samples=1)
result = response_hallucination.consistency_check(
prompt="who was Marie Curie?",
response="Marie Curie was an English barrister and politician who served as Member of Parliament for Thetford from 1859 to 1868.",
)
print(result)
Here’s the output for running that code snippet that includes the semantic similarity score between the prompt, the generated target response, and other generated responses:
LangKit can help detect private personal information (PII), which may be useful if your RAG system contains a lot of PII:
from langkit import extract, pii
data = {"prompt": prompts[0],
"response": prompts[-1]}
result = extract(data)
Our output is a JSON file that displays our original prompt, then the sensitive words we found and their scores:
Step 3: Building the RAG system
This step is where the magic happens! When building your RAG system, you integrate two key components: the Preprocessor and the Retriever-Reader System.
- Preprocessor: Organizes the knowledge base by dividing documents into smaller units (e.g., passages) for efficient retrieval.
- Retriever-reader system: Analyzes user queries, searches the preprocessed knowledge base, and extracts/summarizes relevant information for the LLM.
Together, these components provide the LLM (powering a third component—the generator) with the information needed for accurate and comprehensive responses, enhancing the quality of the generated text.
Additionally, you can change the RAG system’s design (e.g., modify the chunk size in the preprocessor or change the retrieval system). Using LangKit, you can quickly assess the effects of these changes on the LLM.
With LangKit, you can also create a feedback loop to see how changes in the retrieval system affect the overall output when evaluating the RAG using a Judge LLM in the next step.
LangKit provides quantifiable data points (toxicity scores, semantic scores, etc.), allowing you to make data-driven decisions on the most effective architectural choice.
Step 4: Judge LLM evaluation
The final step involves evaluating the RAG system's performance using an LLM as a "judge." Here, an LLM like GPT-4 assesses the usefulness of the RAG system's outputs on a prepared evaluation dataset.
While large and reputable LLMs can provide valuable insights, a robust evaluation process requires a broader scope. Here's how you can integrate LangKit to achieve a more comprehensive assessment:
- Detecting jailbreaks and prompt injections: Identifying attempts to manipulate the system through malicious techniques like jailbreaks (attempts to bypass safety restrictions) or prompt injections (inserting malicious instructions into prompts). Here’s a notebook you can follow to detect jailbreaks, prompt injections, and malicious attacks with LangKit.
- Monitoring sentiment and toxicity: Ensuring responses are informative, unbiased, and neutral in tone. Here’s a notebook you can follow to understand how to track the sentiment and toxicity of LLM-generated responses with LangKit.
- Response consistency: Evaluating whether responses are consistent with retrieved information and past interactions. Here’s a notebook you can follow to understand how to analyze LLM-generated responses for consistency with LangKit.
- Monitoring LLM behavior over time: Tracking key metrics to identify shifts in performance or bias for continuous improvement. Here’s a notebook you can follow to understand how to keep track of the LLM’s behavior with LangKit.
You create a more comprehensive and reliable evaluation process for your RAG system by incorporating the full scope of LangKit's features. This should, of course, lead to a more secure, reliable, and trustworthy system for your users.
Using LangKit with WhyLabs AI Control Center to identify areas for improving RAG applications
Once you’ve assessed our LLM metrics, how can you visualize and compare our results?
WhyLabs AI Control Center is a centralized platform for monitoring and managing AI applications, including LLMs. It provides a comprehensive view of AI system performance, enabling teams to identify issues, optimize models, and ensure responsible AI development.
As a plus, LangKit natively integrates with the WhyLabs AI Control Center so that you can send the extracted metrics directly to the platform. These metrics can include relevance scores, sentiment analysis, and other key performance indicators (KPIs) specific to your RAG system.
Visualizing these critical metrics on clear dashboards lets you quickly identify trends and potential LLM performance problems over time. For example, if the sentiment analysis scores from LangKit consistently indicate high negativity in the generated responses, this could signal an issue with the LLM's training data or prompt design.
WhyLabs dashboards allow you to spot trends early and adjust the model or prompts to improve the tone and quality of content.
In addition to visualization, WhyLabs AI Control Center allows you to define thresholds for your LLM's performance metrics. If metrics fall outside these predefined ranges, WhyLabs can trigger alerts to notify you of potential issues.
This proactive approach enables you to address concerns before they escalate to ensure a consistently high-quality user experience.
WhyLabs also enables you to combine LangKit data with data from other monitoring tools, creating a holistic view of your LLM's behavior. Integrating data from multiple sources helps you understand your LLM and make better optimization and use decisions.
Combining WhyLabs AI Control Center and LangKit creates a powerful toolkit for building secure, reliable, and trustworthy RAG systems. WhyLabs is the central hub for security, explainability, and responsible AI development, while LangKit provides the tools to monitor and measure each process step.
These complementary platforms can help your RAG system become an effective, responsible AI tool that provides exceptional user value and experience.
Conclusion:
Retrieval Augmented Generation (RAG) systems combine the power of LLMs with robust information retrieval systems. By giving LLMs access to and using real-time, domain-specific data, RAG systems can produce more accurate, useful, and reliable results than LLMs that work alone.
However, developing and deploying RAG systems comes with its own set of challenges. These include ensuring high-quality data retrieval, seamless integration between the retrieval and generation components, handling non-deterministic LLM behavior, maintaining data quality, and addressing security, safety, and compliance concerns.
A rigorous evaluation process is essential to overcome these challenges and realize the full potential of RAG systems. Tools like LangKit and WhyLabs AI Control Center play a crucial role in this process, assessing performance, identifying issues, and making data-driven improvements at every stage of development.
Following these development, evaluation, and monitoring practices, organizations can use RAG systems to drive innovation, improve decision-making, and deliver exceptional user value while prioritizing responsible and ethical AI development.
Additional Resources
- LangKit Github Repository
- Sending LLM Metrics to WhyLabs
- WhyLabs AI Control Center
- HuggingFace: Technical Guide on Building a RAG Evaluator
- ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent
- A portion of this paper discusses the metrics used to assess a critique agent.
- Prometheus on HuggingFace
- This is comparable to GPT-4 when it comes to Q&A assessments. It also discusses the importance of clearly defining each numeric score when generating your RAG Q&A dataset.
Other posts
Understanding and Implementing the NIST AI Risk Management Framework (RMF) with WhyLabs
Dec 10, 2024
- AI risk management
- AI Observability
- AI security
- NIST RMF implementation
- AI compliance
- AI risk mitigation
Best Practicies for Monitoring and Securing RAG Systems in Production
Oct 8, 2024
- Retrival-Augmented Generation (RAG)
- LLM Security
- Generative AI
- ML Monitoring
- LangKit
WhyLabs Integrates with NVIDIA NIM to Deliver GenAI Applications with Security and Control
Jun 2, 2024
- AI Observability
- Generative AI
- Integrations
- LLM Security
- LLMs
- Partnerships
OWASP Top 10 Essential Tips for Securing LLMs: Guide to Improved LLM Safety
May 21, 2024
- LLMs
- LLM Security
- Generative AI
7 Ways to Evaluate and Monitor LLMs
May 13, 2024
- LLMs
- Generative AI
How to Distinguish User Behavior and Data Drift in LLMs
May 7, 2024
- LLMs
- Generative AI